Page MenuHomePhabricator

Run WDCM (non-productionized) updates
Closed, ResolvedPublic


Run WDCM Engine updates manually until T171258 is resolved.

Reporting back on the updates here.

This ticket will remain opened until T171258 is resolved.

Event Timeline


Believe it or not, but with the new setup of the stat1004, stat1005, and stat10006 statboxes, we cannot update the WDCM even manually!

Please see T181094. I need a machine that can do all of these: R, Apache Sqoop, beeline, and mySQL to access the MariaDB replicas. And currently I don't seem to have such a machine at my disposal.

The setup that we've needed on stat1004 was kindly provided by @Ottomata today.

I'm currently running the WDCM update sqoop step there.

  • Sqoop phase completed from stat1004
  • Preparing and running the search and pre-process phase from stat1005 now
  • re-run Search and Pre-process phases from stat1005 now (see: T174896#3787409)
  • November 29: this is taking more than expected (the engine scripts needed a lot of debugging because they've been previously re-designed for production, and are now being run manually again; however, that is a minor problem, while the major reason for the delay is that we work with much more data now).
  • Expected update: if everything goes fine, and I predict that it will, on December 1. we will have a fresh update of the WDCM Dashboards, and then we begin with the regular monthly update cycle. The scripts will be run in their non-productionized version until the production problems in running automated users (analytics-wmde, for example) on the statboxes are resolved.
  • Pre-processing (Hadoop) completed, machine learning phase running.
  • Machine learning phase completed, transfer to Labs

@Lydia_Pintscher @Tobi_WMDE_SW @Jan_Dittrich @Addshore

  • WDCM update is online
  • WDCM now works with 31,269,756 Wikidata items, more than twice more than before
  • VERY GOOD NEWS: The interpretability of semantic models has improved drastically thanks to the increase in the number of items processed
  • We have learned something important: there are many influential Wikidata items that are not hierarchically structured through the "subclass of" relation(s) P279
  • Regular monthly updates will be provided in the beginning of every month as of now
  • @Lydia_Pintscher New items will be added gradually as T174896 progresses
  • @Lydia_Pintscher There's still some immediate work in relation to T174896 to be done, for example it turns out that Architecture Structures are always Geographical Objects
  • @Lydia_Pintscher Info on the current update displays in the upper right corner as soon as the dashboards load
  • @Jan_Dittrich your suggestions on the visualizations will be incorporated, didn't yet have enough time to do it
  • @Jan_Dittrich the dashboard headers are re-designed as suggested
  • @Addshore I will get in touch in respect to puppetization on Labs, since we have to wait for T171258#3777474 to be able to puppetize on the statboxes.
  • @Addshore @Lydia_Pintscher I will get in touch with the Analytics to ask for making data available via HTTP from stat1005, sync the update with Labs, and run the whole thing on cron from my user account on the statboxes - until we can puppetize on the statboxes.
  • Optimization is very tedious T180891 - it will take me some time

I think that would be all for now.

  • T181871 is now opened in order to sync the Labs component of the WDCM with the updates on the statboxes run from my user account.
  • We have a new update online 12/04/2017, where in line with T174896#3802768 Architectural Structure is now removed from Geographical Objects.
  • Next dilemma: should we remove the timezone concepts (UTC_something) from Geographical Objects and introduce a separate Timezone taxon in our WDCM Taxonomy? From the viewpoint of system pragmatics, this could furthermore improve the interpretability of the semantic topic models. @Lydia_Pintscher
  • I will keep this ticket opened until the review of WDCM public datasets T181871 is finished, then
  • the WDCM update will be fully automated from cron in production and synced with (hopefully, until then) puppetized Labs WDCM Dashboards.
  • T181871 is resolved; we can publish the WDCM data from stat1005
  • Syncing w. Labs starts now. The plan is to have a fully automated, synced test WDCM update performed no later then mid-December
  • As of January 2018 WDCM should run regular, semi-productionized (i.e. productionized on Labs, but running as goransm user's script on the statboxes) updates monthly

Test-run with:

  • defensive programming against HS2 too many open connections type of WDCM failure;
  • everything to /srv/published-datasets/wdcm from stat1005, where it will be available to Labs;
  • no DOM objects being built in the XML parse step from SPARQL result set.

@Lydia_Pintscher @Tobi_WMDE_SW @Addshore @Jan_Dittrich

  • test sync update Labs to Production completed;
  • as a result, a new WDCM Dashboards (9. December) is online;
  • going for cron on Labs (WDCM Dashboards Update), then
  • some additional cosmetics for the WDCM Engine scripts in production and going for cron from there.


  • sqoop the data 4 times monthly, so that we can always have a fresh manual update if and when we need one;
  • WDCM Engine in production updates once monthly;
  • WDCM Dashboards update once monthly.

@Addshore @Tobi_WMDE_SW @Lydia_Pintscher @Jan_Dittrich

  • Sync update running on hourly crontab from Labs (wikidataconcepts.eqiad.wmflabs) is fully operational;
  • everything that changes in production (/srv/published-datasets/wdcm) is now copied to Labs (takes approx. 20 seconds only);
  • the full WDCM Engine run (three phases: CollectItems, SearchItems, and Pre-Process, resulting in: Hive database re-creation, ETL + .tsv output files for Labs) takes approx. 32 - 33h:

"21","CollectItems","2017-12-10 01:10:31"
"22","SearchItems","2017-12-11 04:48:44"
"23","Pre-Process","2017-12-11 08:19:51"

  • once the Engine run is finished, Labs take approximately 40 minutes to re-create all of its SQL tables and update the WDCM Dashboards.
  • To Do. For some reason, beeline calls from Rscript do not work from crontab on stat1005. I need to resolve this to fully automate the update. Could be as trivial as a call to beeline with a specified username and password file, something that is not necessary when Rscript is run manually from nohup.

If the ToDo item doesn't turn out to be related to something very nasty, I can say we will have WDCM updates fully automatized very soon.

And then, Puppet.

For some reason, beeline calls from Rscript do not work from crontab on stat1005.

Solved after consulting with @Milimetric on IRC: trivially as it might sound, doing a hive instead of a beeline call works. Don't ask me why.

Thus, the WDCM Engine runs will soon be fully automated on the statboxes.

As soon as this is completed, I will be closing this ticket; everything else related to WDCM in production goes to T179286 and T171258.

  • Running first full Production to Labs WDCM update from crontab on stat1005.
  • When this run finishes, the WDCM Engine in Production (stat1005) will be set to regular updates, and the regular Sqoop updates from stat1004 will be scheduled.
  • Not without drama, of course, but thanks to @Ottomata and @JAllemandou the update is running from my crontab on stat1005.
  • Reporting back in few days from now, as soon as the whole WDCM update is completed :)
  • Fresh WDCM update 2017 December 16 (approx. 31h Engine run)
  • WDCM monthly update scheduled;
  • Putting WDCM_Sqoop.R on crontab from stat1004 now (weekly schedule).

That's it.

@Addshore @Lydia_Pintscher @Tobi_WMDE_SW @Jan_Dittrich

  • WDCM_Sqoop.R - refreshes our data sets, runs every 7th, 14th, 21st, and 27th day of month
  • WDCM_Engine.R - ETL and statistical modeling for WDCM - updates on the first day of month
  • WDCM Dashboards - checks for changes in production hourly, updates immediately.

And now, for what we all love the most: puppetization of the Labs component.

Closing the task.