WDCM: Puppetization
Closed, InvalidPublic
Actions

Assigned To

Authored By

	GoranSMilovanovic
	Jul 21 2017, 1:56 AM

Description

WDCM Scaling/Productionizing

Puppetize WDCM:
(1) the CloudVPS component (Shiny dashboards),
(2) in production (stat100* number crunchers) - if possible.

Details

	Subject	Repo	Branch	Lines +/-
	Add ::statistics::wmde::wdcm	operations/puppet	production	+52 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	GoranSMilovanovic	T171258 WDCM: Puppetization
Resolved	GoranSMilovanovic	T180340 Split WDCM repo
Invalid	None	T180900 puppet statistics::wmde refactor mysql user / config creation for use by both 'graphite' and 'wdcm'
Resolved	GoranSMilovanovic	T180902 Create analytics-wmde on stat1004
Resolved	GoranSMilovanovic	T180904 Hive database for analytics-wmde user
Resolved	Andrew	T185430 Remove wikidataconcepts project once migrated to wmde-dashboards
Resolved	• chasemp	T179307 Revert increased quota for wikidataconcepts Cloud VPS project
Resolved	• chasemp	T185429 Request creation of wmde-dashboards VPS project

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Addshore Reviewed, +1. Please wait until the next step for us to get in touch (Hangouts, Data Analysis Weekly, something) first.

@Addshore @Tobi_WMDE_SW @mpopov

Here we go: I have defined - or at least I think I did - two WDCM Puppet profiles and one Puppet role. All of these refer to WDCM master branches exclusively (currently there are no separate dev/master branches for WDCM). They were modeled by the Discovery Dashboards Puppet manifests. I will enlist the manifests and their descriptions in a hierarchical way alongside the issues in relation the them as I currently see them:

wdcm_base.pp. This is the base WDCM profile and it encompasses (a) R packages to be installed, (b) a call to the r_lang module with a non-default value of the timeout parameter (for {dplyr} and {tidyr}, of which at least the first one can take some time to install), and setting up the Shiny Server Portal (the index.hmtl file).

wdcm_base.pp1 KBDownload

Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'; (2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile; (3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.

wdcm_production.pp - this profile requires the above described wdcm_base.pp profile and does not else but clone the production ready WDCM dashboards from GitHub.

wdcm_production.pp1 KBDownload

wdcm_dashboards.pp - this role includes the above described wdcm_production.pp profile and should define a system::role called role::wdcm::dashboards.

wdcm_dashboards.pp487 BDownload

Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.

@Addshore @Tobi_WMDE_SW please take into your consideration that I do not feel too comfortable with Puppet at all. All three manifests that I've presented here were produced more by making analogies with the Discovery Dashboards puppetization process (and editing @mpopov and the Discovery team roles and profiles) than by relying on some thorough understanding of Puppet. Please help me to figure out *where* should these manifests be placed - if they are doing what they are supposed to do in the first place, of course.

Thanks.

GoranSMilovanovic updated the task description. (Show Details)Oct 20 2017, 11:45 PM

In T171258#3700890, @GoranSMilovanovic wrote:

wdcm_base.pp. This is the base WDCM profile and it encompasses (a) R packages to be installed, (b) a call to the r_lang module with a non-default value of the timeout parameter (for {dplyr} and {tidyr}, of which at least the first one can take some time to install), and setting up the Shiny Server Portal (the index.hmtl file).

wdcm_base.pp1 KBDownload

You don't need to list the packages that are already installed as part of shiny_server: https://github.com/wikimedia/puppet/blob/production/modules/shiny_server/manifests/init.pp#L35--L62 (including dplyr, tidyr, and other tidyverse packages)

Problems/unsolved/unclear in relation to this profile: (1) line 59, source => 'puppet:///modules/profile/wdcm/index.html', - I have no idea where does our Puppet file server run and how to have a file (index.html, the start Shiny Server/Portal page) placed there in order for Puppet to pick-it up and copy to '/srv/shiny-server/index.html'

that would go into modules/profile/files/wdcm/ (e.g. https://github.com/wikimedia/puppet/blob/production/modules/profile/files/discovery_dashboards/index.html)

(2) the RMySQL R package - I remember that there are some problems in relation to this package and for some reasons it is not puppetized (if I understand correctly), so this package is not included in this profile

Here's what I get trying to install RMySQL on one of our dashboard-hosting VMs:

Using PKG_LIBS=-lmysqlclient
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libmysqlclient was not found. Try installing:
 * deb: libmariadb-client-lgpl-dev (Debian, Ubuntu 16.04)
        libmariadbclient-dev (Ubuntu 14.04)
 * rpm: mariadb-devel | mysql-devel (Fedora, CentOS, RHEL)
 * csw: mysql56_dev (Solaris)
 * brew: mariadb-connector-c (OSX)
If libmysqlclient is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libmysqlclient.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘RMySQL’

This tells us we need to have a connector.

There's two ways to go about it:

(Preferred) Include require_package('r-cran-rmysql') like in statistics::packages (used in stat1005) and this will install the necessary dependencies (see https://packages.ubuntu.com/trusty/r-cran-rmysql)
(Showing this as an educational example) You'd include require_package('libmariadbclient-dev') in wdcm_base.pp (e.g. after include ::shiny_server). Then you can

r_lang::cran { 'RMySQL':
  require => Package['libmariadbclient-dev'],
}

(3) there should probably be another version of this manifest developed in which the Shiny Server resource would be removed, for the following reason: on our labs instance (wikidataconcepts.eqiad.wmflabs) we need a Shiny Server, but we don't need it on any of our statboxes where the WDCM pre-processing takes place (currently: stat1005); so, we probably need a separate WDCM profile for the statboxes, and a separate one for labs. This one is for labs.

If you're talking about puppetizing the WDCM pre-processing, you might be blocked by T174110. My team tried to puppetize our daily executions of metric-calculating scripts, but for now we have to run them under my staff account until we can have system users with private data access.

wdcm_production.pp - this profile requires the above described wdcm_base.pp profile and does not else but clone the production ready WDCM dashboards from GitHub.

wdcm_production.pp1 KBDownload

You need to have unique directory in each of those. You also need to specify an origin (see https://github.com/wikimedia/puppet/blob/production/modules/git/manifests/clone.pp) if you're cloning from GitHub (git::clone assumes Gerrit).

wdcm_dashboards.pp - this role includes the above described wdcm_production.pp profile and should define a system::role called role::wdcm::dashboards.

wdcm_dashboards.pp487 BDownload

Problems/unsolved/unclear in relation to this role: I am not sure that I understand clearly what is system::role - my assumption is that it is a role defined on some more general level in the WMD Puppet hierarchy. Please advise.

@Gehel how would you explain system::role and its usage in roles?

Hope that helps!

Addshore added a project: User-Addshore.Oct 30 2017, 10:07 AM

So, @GoranSMilovanovic & I sat down today and now I have a fairly good idea of how the whole system works.

I have requested an increase in resources for the cloud project so that we can provision a new instance to apply puppet roles to while we create them and being both instances in line before making the switch. T179307
There is also a very WIP patch for the shiney side of things based on the structure of the discover dashboards @ https://gerrit.wikimedia.org/r/#/c/387211/

The work on R scripts in production can be seen at https://gerrit.wikimedia.org/r/#/c/369902/

I have requested a collection of new gerrit repos to split the code up a bit @ https://www.mediawiki.org/w/index.php?title=Gerrit/New_repositories/Requests/Entries&diff=prev&oldid=2604334
This will result in us having a repo for each dashboard setup, a repo for the landing page for the dashboards, and a repo for the R code needed on the stat machines.

All review of the gerrit patches linked would be greatly appreciated!

Addshore moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-Addshore board.Oct 30 2017, 4:30 PM

GoranSMilovanovic triaged this task as High priority.Nov 1 2017, 10:07 AM

@Addshore The current situation in production (stat1005):

a simple R script was used to install all WDCM relevant R packages on stat1005 to:

/srv/analytics-wmde/r-library

run as sudo -u analytics-wmde Rscript <script_name> (where Rscript is a command line call to an R script).

The installation R script is found in /srv/analytics-wmde/installRlib and I hope you don't mind if we keep it there for further R package installations.

Now, to call the WDCM scripts in production productionized - given the current constraints (T174110) as mentioned by @mpopov - my understanding is that we need

to include something like this in our Puppet configuration (somewhere in the configuration, don't ask me where!), where the directory should be /srv/analytics-wmde/r-library so that Puppet knows we have our own R library path, and
to place a proper call to .libPaths() to all WDCM R scripts in production, which I will take care about.

Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.

Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).

Hopefully, puppetizing on Labs shouldn't present any similar obstacles.

As of the WDCM scaling (at this point, basically running away from {maptpx} which is not maintained anymore, and switching to LDA from MLlib), I've tested SparkR on our fresh Spark 2.1.2 installation from stat1005 (see T139487#3752443) and everything seems to work fine. I would highly appreciate any hints that would enable me to work with {sparklyr} (see T139487#3558525), however, if that takes too much time or is simply to complicated to fix, I will probably go for SparkR.

In T171258#3752323, @GoranSMilovanovic wrote:

@Addshore The current situation in production (stat1005):

a simple R script was used to install all WDCM relevant R packages on stat1005 to:

/srv/analytics-wmde/r-library

run as sudo -u analytics-wmde Rscript <script_name> (where Rscript is a command line call to an R script).

The installation R script is found in /srv/analytics-wmde/installRlib and I hope you don't mind if we keep it there for further R package installations.

So, we can probably add this script to the stat box as part of the puppet manifest, but then require the installation of packages to be done manually.
The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.

Now, to call the WDCM scripts in production productionized - given the current constraints (T174110) as mentioned by @mpopov - my understanding is that we need

to include something like this in our Puppet configuration (somewhere in the configuration, don't ask me where!), where the directory should be /srv/analytics-wmde/r-library so that Puppet knows we have our own R library path, and

As far as I can tell $rlib_dir isn't actually used anywhere really, other than making sure the directory has been created, which is fine, we can do that!

to place a proper call to .libPaths() to all WDCM R scripts in production, which I will take care about.

Great, I guess this will be in the WDCM repo?

Also, with this configuration, I guess we don't need neither r_lang nor r_lang::cran classes in production - now I tend to think that Discovery uses these only on Labs - because installation R script on stat1005 takes care of everything including the CRAN mirror to use.

The puppet manifest should probably still require R to be installed, but not worry about any of the packages.

Please, tell me whether you can take care of (1), or should I help figure out where in our Puppet configuration should we place the declaration of the analytics-wmde's R library (again, it is: /srv/analytics-wmde/r-library).

I'll take another look at the current puppet patch today and work in everything discussed above.

Hopefully, puppetizing on Labs shouldn't present any similar obstacles.

Labs will be much easier :)

The 'correct' way of doing this would probably be to have another git repo called WDCM-packages that has a committed version of all packages that would then be cloned within puppet.

This could also just be in the WDCM repo.

Addshore created subtask T180340: Split WDCM repo.Nov 13 2017, 1:12 PM

GoranSMilovanovic closed subtask T180340: Split WDCM repo as Resolved.Nov 13 2017, 2:48 PM

So I just tried out the idea of installing all the libs and then shoving them in a git repo, but it turns out its quite a large commit 66MB and +2542486 lines.....

https://gerrit.wikimedia.org/r/#/c/391069/

This is partly due to the fact that all library dependencies also get installed alongside the ones that are explicitly requested.
I guess there are also lots of files / paths that could be added to the gitignore file to make the commit smaller.

Perhaps the best route forward is to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages to a directory?

@Addshore A clear case of telepathy, I'm meditating the very same dilemma at the moment...

Yes, R packages are nasty: most of them will be greedy in respect to the dependencies, and installing one or two might end up in installing a significant proportion of CRAN indeed :)

I guess we go with your proposal:

... to add the /srv/analytics-wmde/installRlib script to puppet, but require manual execution on the stat box to install the packages ...

because that seems to be the only reasonable thing that we can do and still stay aligned with the constraints as described in T174110.

Next step: I'm adding the installation script /srv/analytics-wmde/installRlib/_installProduction_analytics-wmde.R to the WDCM repo. Patch upcoming.

In T171258#3756436, @GoranSMilovanovic wrote:

Next step: I'm adding the installation script /srv/analytics-wmde/installRlib/_installProduction_analytics-wmde.R to the WDCM repo. Patch upcoming.

Great, and I guess you can modify the script so that it installed to the CWD/libs or /packages or something like that?
And then also adjust all of the .R scripts to read the deps from that directory?

The script installs all necessary WDCM R packages to: /srv/analytics-wmde/r-library

And then also adjust all of the .R scripts to read the deps from that directory?

The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.

Puppetization as I see it now: ensure /srv/analyticswmde/installRlib/_installProduction_analytics-wmde is present, and then run it manually.

I think you are proposing to: install a script via Puppet, but run it manually yourselves. This sounds fine.

In T171258#3756468, @GoranSMilovanovic wrote:

The script installs all necessary WDCM R packages to: /srv/analytics-wmde/r-library

So this should not be hard coded, either the path should be relative to where the script is run from or be able to be passed in as a parameter.

And then also adjust all of the .R scripts to read the deps from that directory?

The deps == package dependencies? Well, no. The script runs a typical R package installation procedure, which means the dependencies are installed by automation during the installation of the desired package. No dependencies will reside in /srv/analytics-wmde/r-library or anywhere else on our systems.

I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?
Such as https://github.com/wikimedia/analytics-wmde-WDCM/blob/master/WDCM_Collect_Items.R#L50 ?

If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.
This keeps everything WDCM related under the /srv/analytics-wmde/WDCM directory.

In T171258#3756479, @Ottomata wrote:

I think you are proposing to: install a script via Puppet, but run it manually yourselves. This sounds fine.

Great, we will continue down this path then and abandon the idea of keeping them in git as they are too massive. :)

@Ottomata That's the idea.

@Addshore

I mean, how do your R scripts in the WDCM repo know where to load the deps that you have in r-library from?

By informing the R scripts in the WDCM repo (and the Dashboards) where do the packages reside by .libPaths() in the respective R code.

If we clone the WDCM repo to /srv/analytics-wmde/WDCM/src, running the install scripts could install the R packages to /srv/analytics-wmde/WDCM/src/r-library or /srv/analytics-wmde/WDCM/r-library instead of /srv/analytics-wmde/r-library.

The installation script /srv/analyticswmde/installRlib/_installProduction_analytics-wmde explicitly defines the path where the packages should be installed, and that is: /srv/analytics-wmde/r-library.

Wait well hopefully R's install.packages() will follow the same path for the dependencies too, let me check!

Change 369902 merged by Ottomata:
[operations/puppet@production] Add ::statistics::wmde::wdcm

https://gerrit.wikimedia.org/r/369902

So, the code is now cloned on production machines, along with a script to install R libraires in the correct location to be used by scripts.

Over the coming days @GoranSMilovanovic will remove hardcoded paths from all of the scripts and write a single script that can be added to cron in puppet.

Once that is done the WDCM part of puppetization is done (as far as I am aware)

@Addshore I think that's it. It will not take me to much time to unify the scripts.

Addshore moved this task from Back Burner 🏛️ to Active 🚁 on the User-Addshore board.Nov 14 2017, 6:21 PM

@Addshore Confirming that all necessary R packages in /srv/analytics-wmde/wdcm/src/r-library are in place.

The following steps as I see them:

the unification of the WDCM Engine scripts in production - hopefully, Nov 16, Thursday (at the latest).

running one (last) manual WDCM Engine run (WDCM update) to ensure everything's in place (weekend, Nov 18 - 19).

putting WDCM on cron in production (the beginning of the following week).

Sounds good!

GoranSMilovanovic mentioned this in T179286: WDCM Regular Updates.Nov 15 2017, 5:50 PM

Addshore moved this task from Active 🚁 to Blocked 🚧 on the User-Addshore board.Nov 16 2017, 3:39 PM

@Addshore Currently running one, hopefully last, manual WDCM run, unifying the WDCM Engine scripts into one WDCM_Engine.R R script along the way.

@Lydia_Pintscher The quick fix mentoned in T174896 will take place during this operation.

@Ottomata @Addshore

(1) Could we please have the user analytics-wmde added to the analytics-research-client mySQL group?

(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:

sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dir

Note: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).

OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?

Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.

In T171258#3772596, @GoranSMilovanovic wrote:

@Ottomata @Addshore

(1) Could we please have the user analytics-wmde added to the analytics-research-client mySQL group?

There is already a user that you can use on stat1005 from the analytics-wmde user.
You can find the details @ /etc/mysql/conf.d/research-wmde-client.cnf per https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/statistics/manifests/wmde/graphite.pp;f75a67626f6c0f029ca0caa78987b35c238fde58$39
I have created T180900 for a tiny bit of puppet refactoring I should do regarding the use of this file, but you can use it as is for now!

(2) User analytics-wmde also needs a Hive database and to be able to Sqoop; analytics-wmde should be able to do exactly the same as the user goransm does on stat1004 in what follows:
sqoop import --connect jdbc:mysql://analytics-store.eqiad.wmnet/dewiki --password-file /user/goransm/mysql-analytics-research-client-pw.txt --username research -m 4 --query "select * from wbc_entity_usage where \$CONDITIONS" --split-by eu_row_id --as-avrodatafile --target-dir /user/goransm/wdcmsqoop/wdcm_clients_wb_entity_usage/wiki_db=dewiki --delete-target-dir
Note: T171072 was opened in relation to whether or not we can run Sqoop from stat1005, because if we can't, we're in serious trouble here (because on stat1004 we can do Sqoop, but no R and number crunching there).

OR we can split the WDCM system and then run a (1) regular Sqoop operation on cron from stat1004 and (2) regular WDCM updates from stat1005?

Nevertheless, user analytics-wmde needs access to MySQL and a Hive database. If the Sqoop operation will run from stat1004, it must be able to write to the analytics-wmde's Hive database.

Splitting it is definitely possible.
The user should already have access to hive (as some of our scripts already access hive) howeever I don't know about the sqoop situation or what permissions in hive the user has.

@Addshore analytics-wmde can do beelinee, but it does not have its own database. Two ways to go:

(1) make analytics-wmde use goransm database (use goransm from beeline after sudo -u analytics-wmde beeline works), or
(2) better, have a Hive database created for analytics-wmde.

I will test mySQL from analytics-wmde now.

As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simply split WDCM across two servers and have two cron jobs running, one for the Sqoop update, another for the full system update.

In T171258#3772815, @GoranSMilovanovic wrote:

@Addshore analytics-wmde can do beelinee, but it does not have its own database. Two ways to go:

(1) make analytics-wmde use goransm database (use goransm from beeline after sudo -u analytics-wmde beeline works), or
(2) better, have a Hive database created for analytics-wmde.

(2) definitely sounds like the correct way to go there. I'm sure @Ottomata can help us there.

I will test mySQL from analytics-wmde now.

As of the Sqoop situation: let's split it. I remember @Ottomata describing the difficulties to make Sqoop run from **stat1005* extensively. Java 8 won't be there before Q1 2018, so... let's simplu split WDCM across two servers and have to cron jobs running, one for the Sqoop update, another for the full system update.

Thats fine :)
I believe I will have to do a few more puppet changes to make the wdcm manifest be applied to both servers (with differing configs on each)
I guess this switching logic has to go in profile::statistics::private or higher. I'll talk with @Ottomata and @elukey :)

@Addshore Sorry for the typos... I had a few really bad days (complicated dental intervention + more planned in the forthcoming week), first coffee still incomplete, already on this ticket for which you can just imagine how much I'm in love with it.

@Addshore No user analytics-wmde on stat1004, of course. This is what I will do: I will run the Sqoop step now from stat1004 as goransm just in order to prepare everything for what needs to be done on stat1005, and then we can take care of stat1004 later on.

In T171258#3772829, @GoranSMilovanovic wrote:

@Addshore No user analytics-wmde on stat1004, of course. This is what I will do: I will run the Sqoop step now from stat1004 as goransm just in order to prepare everything for what needs to be done on stat1005, and then we can take care of stat1004 later on.

I have created T180902 for that.
Now that we need this stuff running on multiple stat machines I'll have to do a little refactoring of our puppet stuff.

Tested mySQL access for analytics-wmde via /etc/mysql/conf.d/research-wmde-client.cnf - all fine.

However, and contrary to my previous statement based on an incomplete test, analytics-wmde cannot use goransm to do Hive on stat1005.

Thus, we have to kindly ask @Ottomata to provide a Hive database for the analytics-wmde user. I've opened a ticket for this: T180904

The Sqoop step will definitely migrate to stat1004 once we have the analytics-wmde user set there (T180902).

GoranSMilovanovic created subtask T180904: Hive database for analytics-wmde user.Nov 19 2017, 12:54 PM

@Addshore Should we proceed to puppetize WDCM on labs until T180904 and other related tasks can be resolved?

GoranSMilovanovic mentioned this in T181035: Run WDCM (non-productionized) updates.Nov 21 2017, 11:23 AM

@Lydia_Pintscher @Addshore @Tobi_WMDE_SW @Jan_Dittrich @Ottomata @mpopov WMDE-Analytics-Engineering

Status/Summary

it seems we can cannot have WDCM in production on the statboxes completely before Q1 2018
reason: we must orchestrate the same user analytics-wmde across stat1004 (regular Sqoop job) and stat1005 (everything else)
that orchestration is rather complicated to deploy and impossible at this point because of the standards that are currently maintained by the Analytics and SRE (see: T174110, T174465, T180902); this will change, but not before Q1 2018
we will proceed to puppetize the labs component of the WDCM (Shiny Dashboards that are being run from the wikidataconcepts labs instance).

In the meantime

I will start running "regular" WDCM updates on monthly basis, manually, from my goransm user on the statboxes
The labs part will be productionized (as this shouldn't be a problem)
Lowering the priority of this task since nothing can be done until Analytics and SRE find some time to provide the configuration that we need on the statboxes
@Addshore I will document the process up to this point on the WDCM Wikitech page
Scaling: I want to try {SparkR} w. Spark 2 on stat1005 to scale the LDA step and eliminate the {maptpx} package (rapid estimation of topic models) from the loop.

I wish to thank everyone who has contributed here in the previous weeks, especially @mpopov @Ottomata and @Addshore

GoranSMilovanovic lowered the priority of this task from High to Medium.Nov 21 2017, 11:28 AM

GoranSMilovanovic mentioned this in T181094: A statbox to update the WDCM system.Nov 21 2017, 8:53 PM

GoranSMilovanovic mentioned this in T181871: Please review the WDCM public datasets and allow them to access published datasets on stat1005.Dec 2 2017, 3:03 AM

GoranSMilovanovic mentioned this in T180904: Hive database for analytics-wmde user.Dec 14 2017, 11:59 PM

GoranSMilovanovic closed subtask T180904: Hive database for analytics-wmde user as Resolved.

Addshore reopened subtask T180904: Hive database for analytics-wmde user as Open.Dec 18 2017, 3:09 PM

GoranSMilovanovic mentioned this in T179307: Revert increased quota for wikidataconcepts Cloud VPS project.Jan 21 2018, 8:52 PM

Addshore mentioned this in T185430: Remove wikidataconcepts project once migrated to wmde-dashboards.Jan 21 2018, 8:56 PM

Addshore added a subtask: T185430: Remove wikidataconcepts project once migrated to wmde-dashboards.

• bd808 added subtasks: T179307: Revert increased quota for wikidataconcepts Cloud VPS project, T185429: Request creation of wmde-dashboards VPS project.Jan 23 2018, 1:08 AM

• chasemp closed subtask T179307: Revert increased quota for wikidataconcepts Cloud VPS project as Resolved.Jan 29 2018, 7:22 PM

• chasemp closed subtask T185429: Request creation of wmde-dashboards VPS project as Resolved.

GoranSMilovanovic moved this task from Wikidata Analytics to Prioritized on the User-GoranSMilovanovic board.Apr 12 2018, 3:24 AM

I'm going to try and take another pass at this in the coming weeks and try to get this done one way or another.

Addshore moved this task from Blocked 🚧 to Next on the User-Addshore board.Jul 31 2018, 11:44 AM

@Addshore Let me know if I can help. I'm into other things (Wiktionary) right now, but if you need me - just let me know.

GoranSMilovanovic moved this task from Prioritized to Wikidata Analytics on the User-GoranSMilovanovic board.Nov 22 2018, 2:22 PM

GoranSMilovanovic added a parent task: T210147: Optimize WDCM update engines.Nov 22 2018, 2:30 PM

WDCM Pre-process module is now scaled to Apache Spark w. pyspark;
all ETL procedures take approximately 30 minutes to run.

Next steps:

TF matrices produced from Spark too?
Refactor WDCM modules; orchestrate pyspark phase(s) from R.

Once this is finished, the "puppetization" part of the ticket will be separated (i.e. the ticket will split into ETL w. Spark - resolved/Puppet - unresolved parts).

All tests successfull, so now it's official:
all WDCM ETL phases are successfully implemented in pyspark;
Modularizing WDCM again: Collect Items phase - R, Pre-Process phase - Python/Spark, Machine Learning phase - R (at least until T203366 is resolved);
Running in production as of January 1. 2019;
OH YEAH : ) - taking less than one hour to complete the ETL phase now.

Because the WDCM scaling to Apache Spark w. pyspark is documented in T210147 too,
I am changing the task description to reflect only the deployment issues (Puppet).

GoranSMilovanovic renamed this task from WDCM: Process Module Scaling/migrate to Production to WDCM: Puppetization.Jan 6 2019, 5:29 PM

GoranSMilovanovic updated the task description. (Show Details)

Addshore moved this task from Next to Back Burner 🏛️ on the User-Addshore board.Jan 17 2019, 5:18 PM

Addshore moved this task from Incoming to Data Analytics on the WMDE-Analytics-Engineering board.Jan 29 2019, 9:32 AM

GoranSMilovanovic changed the status of subtask T185430: Remove wikidataconcepts project once migrated to wmde-dashboards from Stalled to Open.Feb 7 2019, 3:06 PM

Andrew closed subtask T185430: Remove wikidataconcepts project once migrated to wmde-dashboards as Resolved.Feb 8 2019, 9:23 PM

GoranSMilovanovic removed a parent task: T210147: Optimize WDCM update engines.Feb 11 2019, 10:31 AM

Addshore moved this task from Back Burner 🏛️ to Watching 👀 on the User-Addshore board.Oct 31 2019, 2:44 PM

GoranSMilovanovic closed subtask T180904: Hive database for analytics-wmde user as Resolved.Sep 2 2020, 11:00 PM

Aklapper removed a project: Patch-For-Review.Oct 15 2020, 1:34 PM

Given that

(1) the CloudVPS component (Shiny dashboards) is being productionized/vertically scaled w. shinyproxy/golem - see test server; and,

(2) on the stat100* servers - the WDCM back-end component will be productionized either w. Anaconda (now on all stat100* machines) or pakrat/renv in the next step,

we do not need this anymore.

GoranSMilovanovic changed the task status from Declined to Invalid.Oct 27 2020, 4:10 PM

GoranSMilovanovic closed subtask T180902: Create analytics-wmde on stat1004 as Resolved.

Addshore closed subtask T180900: puppet statistics::wmde refactor mysql user / config creation for use by both 'graphite' and 'wdcm' as Invalid.Oct 29 2020, 6:40 PM

Michael mentioned this in T350252: Figure out if the WDCM repository on stats1007 is actually used.Nov 1 2023, 12:07 PM

	F10342919: wdcm_production.pp
	Oct 20 2017, 11:39 PM

	F10342934: wdcm_dashboards.pp
	Oct 20 2017, 11:39 PM

	F10342911: wdcm_base.pp
	Oct 20 2017, 11:39 PM

WDCM: PuppetizationClosed, InvalidPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

WDCM: Puppetization
Closed, InvalidPublic
Actions

Related Objects
Search...