WDCM Scaling/Productionizing
- Puppetize WDCM:
- (1) the CloudVPS component (Shiny dashboards),
- (2) in production (stat100* number crunchers) - if possible.
GoranSMilovanovic | |
Jul 21 2017, 1:56 AM |
F10342919: wdcm_production.pp | |
Oct 20 2017, 11:39 PM |
F10342934: wdcm_dashboards.pp | |
Oct 20 2017, 11:39 PM |
F10342911: wdcm_base.pp | |
Oct 20 2017, 11:39 PM |
WDCM Scaling/Productionizing
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add ::statistics::wmde::wdcm | operations/puppet | production | +52 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | GoranSMilovanovic | T171258 WDCM: Puppetization | |||
Resolved | GoranSMilovanovic | T180340 Split WDCM repo | |||
Invalid | None | T180900 puppet statistics::wmde refactor mysql user / config creation for use by both 'graphite' and 'wdcm' | |||
Resolved | GoranSMilovanovic | T180902 Create analytics-wmde on stat1004 | |||
Resolved | GoranSMilovanovic | T180904 Hive database for analytics-wmde user | |||
Resolved | Andrew | T185430 Remove wikidataconcepts project once migrated to wmde-dashboards | |||
Resolved | • chasemp | T179307 Revert increased quota for wikidataconcepts Cloud VPS project | |||
Resolved | • chasemp | T185429 Request creation of wmde-dashboards VPS project |
@Addshore Reviewed, +1. Please wait until the next step for us to get in touch (Hangouts, Data Analysis Weekly, something) first.
@Addshore @Tobi_WMDE_SW @mpopov
Here we go: I have defined - or at least I think I did - two WDCM Puppet profiles and one Puppet role. All of these refer to WDCM master branches exclusively (currently there are no separate dev/master branches for WDCM). They were modeled by the Discovery Dashboards Puppet manifests. I will enlist the manifests and their descriptions in a hierarchical way alongside the issues in relation the them as I currently see them:
Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'; (2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile; (3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.
Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.
@Addshore @Tobi_WMDE_SW please take into your consideration that I do not feel too comfortable with Puppet at all. All three manifests that I've presented here were produced more by making analogies with the Discovery Dashboards puppetization process (and editing @mpopov and the Discovery team roles and profiles) than by relying on some thorough understanding of Puppet. Please help me to figure out *where* should these manifests be placed - if they are doing what they are supposed to do in the first place, of course.
Thanks.
You don't need to list the packages that are already installed as part of shiny_server: https://github.com/wikimedia/puppet/blob/production/modules/shiny_server/manifests/init.pp#L35--L62 (including dplyr, tidyr, and other tidyverse packages)
Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'
that would go into modules/profile/files/wdcm/ (e.g. https://github.com/wikimedia/puppet/blob/production/modules/profile/files/discovery_dashboards/index.html)
(2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile
Here's what I get trying to install RMySQL on one of our dashboard-hosting VMs:
Using PKG_LIBS=-lmysqlclient ------------------------- ANTICONF ERROR --------------------------- Configuration failed because libmysqlclient was not found. Try installing: * deb: libmariadb-client-lgpl-dev (Debian, Ubuntu 16.04) libmariadbclient-dev (Ubuntu 14.04) * rpm: mariadb-devel | mysql-devel (Fedora, CentOS, RHEL) * csw: mysql56_dev (Solaris) * brew: mariadb-connector-c (OSX) If libmysqlclient is already installed, check that 'pkg-config' is in your PATH and PKG_CONFIG_PATH contains a libmysqlclient.pc file. If pkg-config is unavailable you can set INCLUDE_DIR and LIB_DIR manually via: R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...' -------------------------------------------------------------------- ERROR: configuration failed for package ‘RMySQL’
This tells us we need to have a connector.
There's two ways to go about it:
r_lang::cran { 'RMySQL': require => Package['libmariadbclient-dev'], }
(3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.
If you're talking about puppetizing the WDCM pre-processing, you might be blocked by T174110. My team tried to puppetize our daily executions of metric-calculating scripts, but for now we have to run them under my staff account until we can have system users with private data access.
- wdcm_production.pp - this profile requires the above described wdcm_base.pp profile and does not else but clone the production ready WDCM dashboards from GitHub.
You need to have unique directory in each of those. You also need to specify an origin (see https://github.com/wikimedia/puppet/blob/production/modules/git/manifests/clone.pp) if you're cloning from GitHub (git::clone assumes Gerrit).
- wdcm_dashboards.pp - this role includes the above described wdcm_production.pp profile and should define a system::role called role::wdcm::dashboards.
Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.
@Gehel how would you explain system::role and its usage in roles?
Hope that helps!
So, @GoranSMilovanovic & I sat down today and now I have a fairly good idea of how the whole system works.
I have requested an increase in resources for the cloud project so that we can provision a new instance to apply puppet roles to while we create them and being both instances in line before making the switch. T179307
There is also a very WIP patch for the shiney side of things based on the structure of the discover dashboards @ https://gerrit.wikimedia.org/r/#/c/387211/
The work on R scripts in production can be seen at https://gerrit.wikimedia.org/r/#/c/369902/
I have requested a collection of new gerrit repos to split the code up a bit @ https://www.mediawiki.org/w/index.php?title=Gerrit/New_repositories/Requests/Entries&diff=prev&oldid=2604334
This will result in us having a repo for each dashboard setup, a repo for the landing page for the dashboards, and a repo for the R code needed on the stat machines.
All review of the gerrit patches linked would be greatly appreciated!
@Addshore The current situation in production (stat1005):
/srv/analytics-wmde/r-library
run as sudo -u analytics-wmde Rscript <script_name> (where Rscript is a command line call to an R script).
The installation R script is found in /srv/analytics-wmde/installRlib and I hope you don't mind if we keep it there for further R package installations.
Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.
Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).
Hopefully, puppetizing on Labs shouldn't present any similar obstacles.
As of the WDCM scaling (at this point, basically running away from {maptpx} which is not maintained anymore, and switching to LDA from MLlib), I've tested SparkR on our fresh Spark 2.1.2 installation from stat1005 (see T139487#3752443) and everything seems to work fine. I would highly appreciate any hints that would enable me to work with {sparklyr} (see T139487#3558525), however, if that takes too much time or is simply to complicated to fix, I will probably go for SparkR.
So, we can probably add this script to the stat box as part of the puppet manifest, but then require the installation of packages to be done manually.
The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.
- Now, to call the WDCM scripts in production productionized - given the current constraints (T174110) as mentioned by @mpopov - my understanding is that we need
- to include something like this in our Puppet configuration (somewhere in the configuration, don't ask me where!), where the directory should be /srv/analytics-wmde/r-library so that Puppet knows we have our own R library path, and
As far as I can tell $rlib_dir isn't actually used anywhere really, other than making sure the directory has been created, which is fine, we can do that!
- to place a proper call to .libPaths() to all WDCM R scripts in production, which I will take care about.
Great, I guess this will be in the WDCM repo?
Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.
The puppet manifest should probably still require R to be installed, but not worry about any of the packages.
Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).
I'll take another look at the current puppet patch today and work in everything discussed above.
Hopefully, puppetizing on Labs shouldn't present any similar obstacles.
Labs will be much easier :)
The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.
This could also just be in the WDCM repo.
So I just tried out the idea of installing all the libs and then shoving them in a git repo, but it turns out its quite a large commit 66MB and +2542486 lines.....
https://gerrit.wikimedia.org/r/#/c/391069/
This is partly due to the fact that all library dependencies also get installed alongside the ones that are explicitly requested.
I guess there are also lots of files / paths that could be added to the gitignore file to make the commit smaller.
Perhaps the best route forward is to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages to a directory?
@Addshore A clear case of telepathy, I'm meditating the very same dilemma at the moment...
Yes, R packages are nasty: most of them will be greedy in respect to the dependencies, and installing one or two might end up in installing a significant proportion of CRAN indeed :)
I guess we go with your proposal:
... to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages ...
because that seems to be the only reasonable thing that we can do and still stay aligned with the constraints as described in T174110.
Next step: I'm adding the installation script /srv/analytics-wmde/installRlib/_installProduction_analytics-wmde.R to the WDCM repo. Patch upcoming.
Great, and I guess you can modify the script so that it installed to the CWD/libs or /packages or something like that?
And then also adjust all of the .R scripts to read the deps from that directory?
The script installs all necessary WDCM R packages to: /srv/analytics-wmde/r-library
And then also adjust all of the .R scripts to read the deps from that directory?
The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.
Puppetization as I see it now: ensure /srv/analyticswmde/installRlib/_installProduction_analytics-wmde is present, and then run it manually.
I think you are proposing to: install a script via Puppet, but run it manually yourselves. This sounds fine.
So this should not be hard coded, either the path should be relative to where the script is run from or be able to be passed in as a parameter.
And then also adjust all of the .R scripts to read the deps from that directory?
The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.
I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?
Such as https://github.com/wikimedia/analytics-wmde-WDCM/blob/master/WDCM_Collect_Items.R#L50 ?
If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.
This keeps everything WDCM related under the /srv/analytics-wmde/WDCM directory.
Great, we will continue down this path then and abandon the idea of keeping them in git as they are too massive. :)
@Ottomata That's the idea.
I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?
By informing the R scripts in the WDCM repo (and the Dashboards) where do the packages reside by .libPaths() in the respective R code.
If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.
The installation script /srv/analyticswmde/installRlib/_installProduction_analytics-wmde explicitly defines the path where the packages should be installed, and that is: /srv/analytics-wmde/r-library.
Wait well hopefully R's install.packages() will follow the same path for the dependencies too, let me check!
Change 369902 merged by Ottomata:
[operations/puppet@production] Add ::statistics::wmde::wdcm
So, the code is now cloned on production machines, along with a script to install R libraires in the correct location to be used by scripts.
Over the coming days @GoranSMilovanovic will remove hardcoded paths from all of the scripts and write a single script that can be added to cron in puppet.
Once that is done the WDCM part of puppetization is done (as far as I am aware)
@Addshore Confirming that all necessary R packages in /srv/analytics-wmde/wdcm/src/r-library are in place.
The following steps as I see them:
@Addshore Currently running one, hopefully last, manual WDCM run, unifying the WDCM Engine scripts into one WDCM_Engine.R R script along the way.
@Lydia_Pintscher The quick fix mentoned in T174896 will take place during this operation.
(1) Could we please have the user analytics-wmde added to the analytics-research-client mySQL group?
(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:
sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dir
Note: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).
OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?
Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.
There is already a user that you can use on stat1005 from the analytics-wmde user.
You can find the details @ /etc/mysql/conf.d/research-wmde-client.cnf per https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/statistics/manifests/wmde/graphite.pp;f75a67626f6c0f029ca0caa78987b35c238fde58$39
I have created T180900 for a tiny bit of puppet refactoring I should do regarding the use of this file, but you can use it as is for now!
(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:
sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dirNote: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).
OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?
Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.
Splitting it is definitely possible.
The user should already have access to hive (as some of our scripts already access hive) howeever I don't know about the sqoop situation or what permissions in hive the user has.
@Addshore analytics-wmde can do beelinee, but it does not have its own database. Two ways to go:
(1) make analytics-wmde use goransm database (use goransm from beeline after sudo -u analytics-wmde beeline works), or
(2) better, have a Hive database created for analytics-wmde.
I will test mySQL from analytics-wmde now.
As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simply split WDCM across two servers and have two cron jobs running, one for the Sqoop update, another for the full system update.
(2) definitely sounds like the correct way to go there. I'm sure @Ottomata can help us there.
I will test mySQL from analytics-wmde now.
As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simplu split WDCM across two servers and have to cron jobs running, one for the Sqoop update, another for the full system update.
Thats fine :)
I believe I will have to do a few more puppet changes to make the wdcm manifest be applied to both servers (with differing configs on each)
I guess this switching logic has to go in profile::statistics::private or higher. I'll talk with @Ottomata and @elukey :)
@Addshore Sorry for the typos... I had a few really bad days (complicated dental intervention + more planned in the forthcoming week), first coffee still incomplete, already on this ticket for which you can just imagine how much I'm in love with it.
@Addshore No user analytics-wmde on stat1004, of course. This is what I will do: I will run the Sqoop step now from stat1004 as goransm just in order to prepare everything for what needs to be done on stat1005, and then we can take care of stat1004 later on.
I have created T180902 for that.
Now that we need this stuff running on multiple stat machines I'll have to do a little refactoring of our puppet stuff.
Tested mySQL access for analytics-wmde via /etc/mysql/conf.d/research-wmde-client.cnf - all fine.
However, and contrary to my previous statement based on an incomplete test, analytics-wmde cannot use goransm to do Hive on stat1005.
Thus, we have to kindly ask @Ottomata to provide a Hive database for the analytics-wmde user. I've opened a ticket for this: T180904
The Sqoop step will definitely migrate to stat1004 once we have the analytics-wmde user set there (T180902).
@Lydia_Pintscher @Addshore @Tobi_WMDE_SW @Jan_Dittrich @Ottomata @mpopov WMDE-Analytics-Engineering
Status/Summary
In the meantime
I wish to thank everyone who has contributed here in the previous weeks, especially @mpopov @Ottomata and @Addshore
I'm going to try and take another pass at this in the coming weeks and try to get this done one way or another.
@Addshore Let me know if I can help. I'm into other things (Wiktionary) right now, but if you need me - just let me know.
Next steps:
Once this is finished, the "puppetization" part of the ticket will be separated (i.e. the ticket will split into ETL w. Spark - resolved/Puppet - unresolved parts).
Given that
(1) the CloudVPS component (Shiny dashboards) is being productionized/vertically scaled w. shinyproxy/golem - see test server; and,
(2) on the stat100* servers - the WDCM back-end component will be productionized either w. Anaconda (now on all stat100* machines) or pakrat/renv in the next step,
we do not need this anymore.