Page MenuHomePhabricator

Skip Wikidata when loading XML dumps to the Data Lake
Closed, ResolvedPublic5 Estimated Story Points

Description

As part of SDS 2.6.2, I've been investigating the data dependencies of the movement metrics. Our critical path takes around 25 days and goes:

By far the longest portion (~19 days) is waiting for the XML dumps to be generated. But after the first 7 days (when the English Wikipedia dump arrives), we're waiting only on the Wikidata dump. I doubt that anyone is regularly using the Wikidata XML dump since wmf.wikidata_entity (which comes from the JSON dump) is much better and faster. The XML dump is apparently the only one that contains non-current data, but that's probably a very rare need.

Can we skip loading the Wikidata XML altogether? Other strategies like splitting it out as a separate job would be fine too, but just skipping it would be much easier and likely fine, with no one using the data.

Event Timeline

Excellent! Don't forget to announce the plan first, just in case there is someone unexpectedly using the data; I recommend the working-with-data Slack channel and the analytics-announce mailing list.

Change 1006957 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add deny-list option to import_mediawiki_dumps

https://gerrit.wikimedia.org/r/1006957

Change 1006957 merged by Joal:

[analytics/refinery@master] Add skip-list option to import_mediawiki_dumps

https://gerrit.wikimedia.org/r/1006957

Change 1007301 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update analytics mediawiki_dumps_import

https://gerrit.wikimedia.org/r/1007301

Change 1007301 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update analytics mediawiki_dumps_import

https://gerrit.wikimedia.org/r/1007301

Change 1007301 merged by Btullis:

[operations/puppet@production] Update analytics mediawiki_dumps_import

https://gerrit.wikimedia.org/r/1007301