WDCM: Process Module Scaling/migrate to Production
Open, NormalPublic

Description

WDCM Scaling/Productionizing

  • Puppetize WDCM
  • Will {sparklyr} ever work from stat1005 for this system to rely on Apache Spark (currently, we're using the best that R can offer in order to barely survive data pre-processing and statistical modeling WDCM related demands on the statboxes), or is pyspark simply the future of WDCM?

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2017, 1:56 AM

Since stat1002 is replaced by stat1005, the WDCM Process Module will be deployed on the new statsbox.

Possible conflict (to discuss this w. Analytics): Sqoop jobs are advised to be run from stat1004, and there were some problems in running the from stats1005. The problems will hopefully be resolved soon.

So @GoranSMilovanovic I did some refactoring of our puppet stuff and added you to a draft patch @ https://gerrit.wikimedia.org/r/#/c/369902/ which is an example of where a class should be for the first thing that we want to add to prod, which is the sqoop job.

@Addshore Thanks, I'll take a look, however we still don't have a Gerrit repo for WDCM: my request on https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests is there several days already.

@Addshore

The request on https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests is finally processed, there is a Gerrit repo for WDCM now:

"Done Created as analytics/wmde/WDCM --QChrisNonWMF (talk) 07:42, 15 August 2017 (UTC)"

@Addshore Thanks. I am still experiencing problems with Gerrit (for example, "git review -s" seems to be deprecated in favor of doing something else), but I don't think it will cause me too much trouble in the long run.

@Tobi_WMDE_SW @Lydia_Pintscher @Jan_Dittrich

The WDCM system is now (almost) completely scaled, meaning: we're handling Wikidata usage statistics across all sister projects. I will need to spin up a discussion with you on what still remains to be done during my stay in Berlin. However, I see nothing that could prevent us from having a full working version delivered and in production during September, as planned.

GoranSMilovanovic added a comment.EditedSep 3 2017, 10:54 PM

Status report:

  • Again, all WDCM dashboards will be developed and operational in September; however
  • they will run on the wikidataconcepts labs instance (their dev enviornment, hopefully soon puppetized properly) until the Analytics help me to
  • productionize everything in production :) - where by "productionize" I mean (1) running Spark from R there, (2) being able to manage an SQL database from the statboxes, and (3) having everything puppetized.

The current situation is tricky

  • I can ran Sqoop scripts from stat1004, but not from stat1005 (Phab ticket opened)
  • I have R on stat1005, but not on stat1004 (Phab ticket opened)
  • It is maybe possible to connect R to Apache Spark from stat1004, unlike from stat1005, but there's no R installation on stat 1004 still
  • etc.

Change 369902 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] WIP DNM Add ::statistics::wmde::wikidata_concepts

https://gerrit.wikimedia.org/r/369902

@GoranSMilovanovic the above patch now includes the checkout of the code, what is new next step to be added to puppet / to productionize this?

@Addshore Reviewed, +1. Please wait until the next step for us to get in touch (Hangouts, Data Analysis Weekly, something) first.

@Addshore @Tobi_WMDE_SW @mpopov

Here we go: I have defined - or at least I think I did - two WDCM Puppet profiles and one Puppet role. All of these refer to WDCM master branches exclusively (currently there are no separate dev/master branches for WDCM). They were modeled by the Discovery Dashboards Puppet manifests. I will enlist the manifests and their descriptions in a hierarchical way alongside the issues in relation the them as I currently see them:

  • wdcm_base.pp. This is the base WDCM profile and it encompasses (a) R packages to be installed, (b) a call to the r_lang module with a non-default value of the timeout parameter (for {dplyr} and {tidyr}, of which at least the first one can take some time to install), and setting up the Shiny Server Portal (the index.hmtl file).

Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'; (2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile; (3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.

  • wdcm_production.pp - this profile requires the above described wdcm_base.pp profile and does not else but clone the production ready WDCM dashboards from GitHub.

  • wdcm_dashboards.pp - this role includes the above described wdcm_production.pp profile and should define a system::role called role::wdcm::dashboards.

Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.

@Addshore @Tobi_WMDE_SW please take into your consideration that I do not feel too comfortable with Puppet at all. All three manifests that I've presented here were produced more by making analogies with the Discovery Dashboards puppetization process (and editing @mpopov and the Discovery team roles and profiles) than by relying on some thorough understanding of Puppet. Please help me to figure out *where* should these manifests be placed - if they are doing what they are supposed to do in the first place, of course.

Thanks.

mpopov added a subscriber: Gehel.Oct 24 2017, 10:09 PM
  • wdcm_base.pp. This is the base WDCM profile and it encompasses (a) R packages to be installed, (b) a call to the r_lang module with a non-default value of the timeout parameter (for {dplyr} and {tidyr}, of which at least the first one can take some time to install), and setting up the Shiny Server Portal (the index.hmtl file).

You don't need to list the packages that are already installed as part of shiny_server: https://github.com/wikimedia/puppet/blob/production/modules/shiny_server/manifests/init.pp#L35--L62 (including dplyr, tidyr, and other tidyverse packages)

Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'

that would go into modules/profile/files/wdcm/ (e.g. https://github.com/wikimedia/puppet/blob/production/modules/profile/files/discovery_dashboards/index.html)

(2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile

Here's what I get trying to install RMySQL on one of our dashboard-hosting VMs:

Using PKG_LIBS=-lmysqlclient
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libmysqlclient was not found. Try installing:
 * deb: libmariadb-client-lgpl-dev (Debian, Ubuntu 16.04)
        libmariadbclient-dev (Ubuntu 14.04)
 * rpm: mariadb-devel | mysql-devel (Fedora, CentOS, RHEL)
 * csw: mysql56_dev (Solaris)
 * brew: mariadb-connector-c (OSX)
If libmysqlclient is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libmysqlclient.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘RMySQL’

This tells us we need to have a connector.

There's two ways to go about it:

  1. (Preferred) Include require_package('r-cran-rmysql') like in statistics::packages (used in stat1005) and this will install the necessary dependencies (see https://packages.ubuntu.com/trusty/r-cran-rmysql)
  2. (Showing this as an educational example) You'd include require_package('libmariadbclient-dev') in wdcm_base.pp (e.g. after include ::shiny_server). Then you can
r_lang::cran { 'RMySQL':
  require => Package['libmariadbclient-dev'],
}

(3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.

If you're talking about puppetizing the WDCM pre-processing, you might be blocked by T174110. My team tried to puppetize our daily executions of metric-calculating scripts, but for now we have to run them under my staff account until we can have system users with private data access.

  • wdcm_production.pp - this profile requires the above described wdcm_base.pp profile and does not else but clone the production ready WDCM dashboards from GitHub.

You need to have unique directory in each of those. You also need to specify an origin (see https://github.com/wikimedia/puppet/blob/production/modules/git/manifests/clone.pp) if you're cloning from GitHub (git::clone assumes Gerrit).

  • wdcm_dashboards.pp - this role includes the above described wdcm_production.pp profile and should define a system::role called role::wdcm::dashboards.



    Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.

@Gehel how would you explain system::role and its usage in roles?

Hope that helps!

So, @GoranSMilovanovic & I sat down today and now I have a fairly good idea of how the whole system works.

I have requested an increase in resources for the cloud project so that we can provision a new instance to apply puppet roles to while we create them and being both instances in line before making the switch. T179307
There is also a very WIP patch for the shiney side of things based on the structure of the discover dashboards @ https://gerrit.wikimedia.org/r/#/c/387211/

The work on R scripts in production can be seen at https://gerrit.wikimedia.org/r/#/c/369902/

I have requested a collection of new gerrit repos to split the code up a bit @ https://www.mediawiki.org/w/index.php?title=Gerrit/New_repositories/Requests/Entries&diff=prev&oldid=2604334
This will result in us having a repo for each dashboard setup, a repo for the landing page for the dashboards, and a repo for the R code needed on the stat machines.

All review of the gerrit patches linked would be greatly appreciated!

Addshore moved this task from Backlog to In Progress on the User-Addshore board.Oct 30 2017, 4:30 PM
GoranSMilovanovic triaged this task as High priority.Nov 1 2017, 10:07 AM

@Addshore The current situation in production (stat1005):

  • a simple R script was used to install all WDCM relevant R packages on stat1005 to:

/srv/analytics-wmde/r-library

run as sudo -u analytics-wmde Rscript <script_name> (where Rscript is a command line call to an R script).

The installation R script is found in /srv/analytics-wmde/installRlib and I hope you don't mind if we keep it there for further R package installations.

  • Now, to call the WDCM scripts in production productionized - given the current constraints (T174110) as mentioned by @mpopov - my understanding is that we need
  1. to include something like this in our Puppet configuration (somewhere in the configuration, don't ask me where!), where the directory should be /srv/analytics-wmde/r-library so that Puppet knows we have our own R library path, and
  2. to place a proper call to .libPaths() to all WDCM R scripts in production, which I will take care about.

Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.

Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).

Hopefully, puppetizing on Labs shouldn't present any similar obstacles.

GoranSMilovanovic added a comment.EditedNov 11 2017, 11:22 AM

As of the WDCM scaling (at this point, basically running away from {maptpx} which is not maintained anymore, and switching to LDA from MLlib), I've tested SparkR on our fresh Spark 2.1.2 installation from stat1005 (see T139487#3752443) and everything seems to work fine. I would highly appreciate any hints that would enable me to work with {sparklyr} (see T139487#3558525), however, if that takes too much time or is simply to complicated to fix, I will probably go for SparkR.

@Addshore The current situation in production (stat1005):

  • a simple R script was used to install all WDCM relevant R packages on stat1005 to:

    /srv/analytics-wmde/r-library

    run as sudo -u analytics-wmde Rscript <script_name> (where Rscript is a command line call to an R script).

    The installation R script is found in /srv/analytics-wmde/installRlib and I hope you don't mind if we keep it there for further R package installations.

So, we can probably add this script to the stat box as part of the puppet manifest, but then require the installation of packages to be done manually.
The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.

  • Now, to call the WDCM scripts in production productionized - given the current constraints (T174110) as mentioned by @mpopov - my understanding is that we need
# to include something like [[ https://github.com/wikimedia/puppet/blob/b009784ef800dc80ecd11592da46ff283b321566/modules/statistics/manifests/discovery.pp#L12-L13 | this ]] in our Puppet configuration (somewhere in the configuration, don't ask me where!), where the directory should be `/srv/analytics-wmde/r-library` so that Puppet knows we have our own R library path, and

As far as I can tell $rlib_dir isn't actually used anywhere really, other than making sure the directory has been created, which is fine, we can do that!

  1. to place a proper call to .libPaths() to all WDCM R scripts in production, which I will take care about.

Great, I guess this will be in the WDCM repo?

Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.

The puppet manifest should probably still require R to be installed, but not worry about any of the packages.

Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).

I'll take another look at the current puppet patch today and work in everything discussed above.

Hopefully, puppetizing on Labs shouldn't present any similar obstacles.

Labs will be much easier :)

The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.

This could also just be in the WDCM repo.

So I just tried out the idea of installing all the libs and then shoving them in a git repo, but it turns out its quite a large commit 66MB and +2542486 lines.....

https://gerrit.wikimedia.org/r/#/c/391069/

This is partly due to the fact that all library dependencies also get installed alongside the ones that are explicitly requested.
I guess there are also lots of files / paths that could be added to the gitignore file to make the commit smaller.

Perhaps the best route forward is to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages to a directory?

@Addshore A clear case of telepathy, I'm meditating the very same dilemma at the moment...

Yes, R packages are nasty: most of them will be greedy in respect to the dependencies, and installing one or two might end up in installing a significant proportion of CRAN indeed :)

I guess we go with your proposal:

... to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages ...

because that seems to be the only reasonable thing that we can do and still stay aligned with the constraints as described in T174110.

Next step: I'm adding the installation script /srv/analytics-wmde/installRlib/_installProduction_analytics-wmde.R to the WDCM repo. Patch upcoming.

Next step: I'm adding the installation script /srv/analytics-wmde/installRlib/_installProduction_analytics-wmde.R to the WDCM repo. Patch upcoming.

Great, and I guess you can modify the script so that it installed to the CWD/libs or /packages or something like that?
And then also adjust all of the .R scripts to read the deps from that directory?

GoranSMilovanovic added a comment.EditedNov 13 2017, 7:55 PM

The script installs all necessary WDCM R packages to: /srv/analytics-wmde/r-library

And then also adjust all of the .R scripts to read the deps from that directory?

The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.

Puppetization as I see it now: ensure /srv/analyticswmde/installRlib/_installProduction_analytics-wmde is present, and then run it manually.

I think you are proposing to: install a script via Puppet, but run it manually yourselves. This sounds fine.

The script installs all necessary WDCM R packages to: /srv/analytics-wmde/r-library

So this should not be hard coded, either the path should be relative to where the script is run from or be able to be passed in as a parameter.

And then also adjust all of the .R scripts to read the deps from that directory?

The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.

I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?
Such as https://github.com/wikimedia/analytics-wmde-WDCM/blob/master/WDCM_Collect_Items.R#L50 ?

If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.
This keeps everything WDCM related under the /srv/analytics-wmde/WDCM directory.

I think you are proposing to: install a script via Puppet, but run it manually yourselves. This sounds fine.

Great, we will continue down this path then and abandon the idea of keeping them in git as they are too massive. :)

GoranSMilovanovic added a comment.EditedNov 13 2017, 8:05 PM

@Ottomata That's the idea.

@Addshore

I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?

By informing the R scripts in the WDCM repo (and the Dashboards) where do the packages reside by .libPaths() in the respective R code.

If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.

The installation script /srv/analyticswmde/installRlib/_installProduction_analytics-wmde explicitly defines the path where the packages should be installed, and that is: /srv/analytics-wmde/r-library.

Wait well hopefully R's install.packages() will follow the same path for the dependencies too, let me check!

Change 369902 merged by Ottomata:
[operations/puppet@production] Add ::statistics::wmde::wdcm

https://gerrit.wikimedia.org/r/369902

So, the code is now cloned on production machines, along with a script to install R libraires in the correct location to be used by scripts.

Over the coming days @GoranSMilovanovic will remove hardcoded paths from all of the scripts and write a single script that can be added to cron in puppet.

Once that is done the WDCM part of puppetization is done (as far as I am aware)

@Addshore I think that's it. It will not take me to much time to unify the scripts.

@Addshore Confirming that all necessary R packages in /srv/analytics-wmde/wdcm/src/r-library are in place.

The following steps as I see them:

  • the unification of the WDCM Engine scripts in production - hopefully, Nov 16, Thursday (at the latest).
  • running one (last) manual WDCM Engine run (WDCM update) to ensure everything's in place (weekend, Nov 18 - 19).
  • putting WDCM on cron in production (the beginning of the following week).

Sounds good!

@Addshore Currently running one, hopefully last, manual WDCM run, unifying the WDCM Engine scripts into one WDCM_Engine.R R script along the way.

@Lydia_Pintscher The quick fix mentoned in T174896 will take place during this operation.

GoranSMilovanovic added a comment.EditedNov 19 2017, 12:29 AM

@Ottomata @Addshore

(1) Could we please have the user analytics-wmde added to the analytics-research-client mySQL group?

(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:

sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dir

Note: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).

OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?

Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.

@Ottomata @Addshore

(1) Could we please have the user analytics-wmde added to the analytics-research-client mySQL group?

There is already a user that you can use on stat1005 from the analytics-wmde user.
You can find the details @ /etc/mysql/conf.d/research-wmde-client.cnf per https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/statistics/manifests/wmde/graphite.pp;f75a67626f6c0f029ca0caa78987b35c238fde58$39
I have created T180900 for a tiny bit of puppet refactoring I should do regarding the use of this file, but you can use it as is for now!

(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:

sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dir

Note: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).

OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?

Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.

Splitting it is definitely possible.
The user should already have access to hive (as some of our scripts already access hive) howeever I don't know about the sqoop situation or what permissions in hive the user has.

GoranSMilovanovic added a comment.EditedNov 19 2017, 11:26 AM

@Addshore analytics-wmde can do beelinee, but it does not have its own database. Two ways to go:

(1) make analytics-wmde use goransm database (use goransm from beeline after sudo -u analytics-wmde beeline works), or
(2) better, have a Hive database created for analytics-wmde.

I will test mySQL from analytics-wmde now.

As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simply split WDCM across two servers and have two cron jobs running, one for the Sqoop update, another for the full system update.

@Addshore analytics-wmde can do beelinee, but it does not have its own database. Two ways to go:

(1) make analytics-wmde use goransm database (use goransm from beeline after sudo -u analytics-wmde beeline works), or
(2) better, have a Hive database created for analytics-wmde.

(2) definitely sounds like the correct way to go there. I'm sure @Ottomata can help us there.

I will test mySQL from analytics-wmde now.

As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simplu split WDCM across two servers and have to cron jobs running, one for the Sqoop update, another for the full system update.

Thats fine :)
I believe I will have to do a few more puppet changes to make the wdcm manifest be applied to both servers (with differing configs on each)
I guess this switching logic has to go in profile::statistics::private or higher. I'll talk with @Ottomata and @elukey :)

@Addshore Sorry for the typos... I had a few really bad days (complicated dental intervention + more planned in the forthcoming week), first coffee still incomplete, already on this ticket for which you can just imagine how much I'm in love with it.

@Addshore No user analytics-wmde on stat1004, of course. This is what I will do: I will run the Sqoop step now from stat1004 as goransm just in order to prepare everything for what needs to be done on stat1005, and then we can take care of stat1004 later on.

@Addshore No user analytics-wmde on stat1004, of course. This is what I will do: I will run the Sqoop step now from stat1004 as goransm just in order to prepare everything for what needs to be done on stat1005, and then we can take care of stat1004 later on.

I have created T180902 for that.
Now that we need this stuff running on multiple stat machines I'll have to do a little refactoring of our puppet stuff.

GoranSMilovanovic added a comment.EditedNov 19 2017, 12:47 PM

Tested mySQL access for analytics-wmde via /etc/mysql/conf.d/research-wmde-client.cnf - all fine.

However, and contrary to my previous statement based on an incomplete test, analytics-wmde cannot use goransm to do Hive on stat1005.

Thus, we have to kindly ask @Ottomata to provide a Hive database for the analytics-wmde user. I've opened a ticket for this: T180904

The Sqoop step will definitely migrate to stat1004 once we have the analytics-wmde user set there (T180902).

@Addshore Should we proceed to puppetize WDCM on labs until T180904 and other related tasks can be resolved?

GoranSMilovanovic added a comment.EditedNov 21 2017, 11:26 AM

@Lydia_Pintscher @Addshore @Tobi_WMDE_SW @Jan_Dittrich @Ottomata @mpopov WMDE-Analytics-Engineering

Status/Summary

  • it seems we can cannot have WDCM in production on the statboxes completely before Q1 2018
  • reason: we must orchestrate the same user analytics-wmde across stat1004 (regular Sqoop job) and stat1005 (everything else)
  • that orchestration is rather complicated to deploy and impossible at this point because of the standards that are currently maintained by the Analytics and Operations (see: T174110, T174465, T180902); this will change, but not before Q1 2018
  • we will proceed to puppetize the labs component of the WDCM (Shiny Dashboards that are being run from the wikidataconcepts labs instance).

In the meantime

  • I will start running "regular" WDCM updates on monthly basis, manually, from my goransm user on the statboxes
  • The labs part will be productionized (as this shouldn't be a problem)
  • Lowering the priority of this task since nothing can be done until Analytics and Operations find some time to provide the configuration that we need on the statboxes
  • @Addshore I will document the process up to this point on the WDCM Wikitech page
  • Scaling: I want to try {SparkR} w. Spark 2 on stat1005 to scale the LDA step and eliminate the {maptpx} package (rapid estimation of topic models) from the loop.

I wish to thank everyone who has contributed here in the previous weeks, especially @mpopov @Ottomata and @Addshore

GoranSMilovanovic lowered the priority of this task from High to Normal.Nov 21 2017, 11:28 AM

I'm going to try and take another pass at this in the coming weeks and try to get this done one way or another.

@Addshore Let me know if I can help. I'm into other things (Wiktionary) right now, but if you need me - just let me know.