- Mentioned In
- T180571: Shiny Dashboard for usage of the AdvancedSearch extension
T174896: WDCM Structure Dashboard: Refining the WDCM Taxonomy
- Mentioned Here
- T179286: WDCM Regular Updates
T181871: Please review the WDCM public datasets and allow them to access published datasets on stat1005
T180891: WDCM: Optimize WDCM Shiny Dashboards
P279 404 and 500 error pages
T174896: WDCM Structure Dashboard: Refining the WDCM Taxonomy
T181094: A statbox to update the WDCM system
T171258: WDCM: Puppetization
Believe it or not, but with the new setup of the stat1004, stat1005, and stat10006 statboxes, we cannot update the WDCM even manually!
Please see T181094. I need a machine that can do all of these: R, Apache Sqoop, beeline, and mySQL to access the MariaDB replicas. And currently I don't seem to have such a machine at my disposal.
- November 29: this is taking more than expected (the engine scripts needed a lot of debugging because they've been previously re-designed for production, and are now being run manually again; however, that is a minor problem, while the major reason for the delay is that we work with much more data now).
- Expected update: if everything goes fine, and I predict that it will, on December 1. we will have a fresh update of the WDCM Dashboards, and then we begin with the regular monthly update cycle. The scripts will be run in their non-productionized version until the production problems in running automated users (analytics-wmde, for example) on the statboxes are resolved.
- WDCM update is online
- WDCM now works with 31,269,756 Wikidata items, more than twice more than before
- VERY GOOD NEWS: The interpretability of semantic models has improved drastically thanks to the increase in the number of items processed
- We have learned something important: there are many influential Wikidata items that are not hierarchically structured through the "subclass of" relation(s) P279
- Regular monthly updates will be provided in the beginning of every month as of now
- @Lydia_Pintscher New items will be added gradually as T174896 progresses
- @Lydia_Pintscher There's still some immediate work in relation to T174896 to be done, for example it turns out that Architecture Structures are always Geographical Objects
- @Lydia_Pintscher Info on the current update displays in the upper right corner as soon as the dashboards load
- @Jan_Dittrich your suggestions on the visualizations will be incorporated, didn't yet have enough time to do it
- @Jan_Dittrich the dashboard headers are re-designed as suggested
- @Addshore I will get in touch in respect to puppetization on Labs, since we have to wait for T171258#3777474 to be able to puppetize on the statboxes.
- @Addshore @Lydia_Pintscher I will get in touch with the Analytics to ask for making data available via HTTP from stat1005, sync the update with Labs, and run the whole thing on cron from my user account on the statboxes - until we can puppetize on the statboxes.
- Optimization is very tedious T180891 - it will take me some time
I think that would be all for now.
- We have a new update online 12/04/2017, where in line with T174896#3802768 Architectural Structure is now removed from Geographical Objects.
- Next dilemma: should we remove the timezone concepts (UTC_something) from Geographical Objects and introduce a separate Timezone taxon in our WDCM Taxonomy? From the viewpoint of system pragmatics, this could furthermore improve the interpretability of the semantic topic models. @Lydia_Pintscher
- T181871 is resolved; we can publish the WDCM data from stat1005
- Syncing w. Labs starts now. The plan is to have a fully automated, synced test WDCM update performed no later then mid-December
- As of January 2018 WDCM should run regular, semi-productionized (i.e. productionized on Labs, but running as goransm user's script on the statboxes) updates monthly
- defensive programming against HS2 too many open connections type of WDCM failure;
- everything to /srv/published-datasets/wdcm from stat1005, where it will be available to Labs;
- no DOM objects being built in the XML parse step from SPARQL result set.
- test sync update Labs to Production completed;
- as a result, a new WDCM Dashboards (9. December) is online;
- going for cron on Labs (WDCM Dashboards Update), then
- some additional cosmetics for the WDCM Engine scripts in production and going for cron from there.
- sqoop the data 4 times monthly, so that we can always have a fresh manual update if and when we need one;
- WDCM Engine in production updates once monthly;
- WDCM Dashboards update once monthly.
- Sync update running on hourly crontab from Labs (wikidataconcepts.eqiad.wmflabs) is fully operational;
- everything that changes in production (/srv/published-datasets/wdcm) is now copied to Labs (takes approx. 20 seconds only);
- the full WDCM Engine run (three phases: CollectItems, SearchItems, and Pre-Process, resulting in: Hive database re-creation, ETL + .tsv output files for Labs) takes approx. 32 - 33h:
- once the Engine run is finished, Labs take approximately 40 minutes to re-create all of its SQL tables and update the WDCM Dashboards.
- To Do. For some reason, beeline calls from Rscript do not work from crontab on stat1005. I need to resolve this to fully automate the update. Could be as trivial as a call to beeline with a specified username and password file, something that is not necessary when Rscript is run manually from nohup.
If the ToDo item doesn't turn out to be related to something very nasty, I can say we will have WDCM updates fully automatized very soon.
And then, Puppet.
For some reason, beeline calls from Rscript do not work from crontab on stat1005.
Solved after consulting with @Milimetric on IRC: trivially as it might sound, doing a hive instead of a beeline call works. Don't ask me why.
Thus, the WDCM Engine runs will soon be fully automated on the statboxes.
- Running first full Production to Labs WDCM update from crontab on stat1005.
- When this run finishes, the WDCM Engine in Production (stat1005) will be set to regular updates, and the regular Sqoop updates from stat1004 will be scheduled.
- WDCM_Sqoop.R - refreshes our data sets, runs every 7th, 14th, 21st, and 27th day of month
- WDCM_Engine.R - ETL and statistical modeling for WDCM - updates on the first day of month
- WDCM Dashboards - checks for changes in production hourly, updates immediately.
And now, for what we all love the most: puppetization of the Labs component.
Closing the task.