Page MenuHomePhabricator

Import siteinfo dumps onto HDFS
Closed, ResolvedPublic5 Estimated Story Points

Description

siteinfo dumps contain, for each wiki, magic-words aliases, meaning multi-language keywords of wikitext (for instance REDIRECTION in French for REDIRECT). Importing these files and possibly transforming them onto an easier-to-query structure (parquet based) is a necessary step toward productionizing historical redirects extraction.

Event Timeline

Change 540124 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add site-info dump type to importer

https://gerrit.wikimedia.org/r/540124

Change 540124 merged by Nuria:
[analytics/refinery@master] Add site-info dump type to importer

https://gerrit.wikimedia.org/r/540124

Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Change 546966 had a related patch set uploaded (by Joal; owner: Joal):
[operations/puppet@production] [WIP] Refactor profile::analytics::refinery::job::import_mediawiki_dumps

https://gerrit.wikimedia.org/r/546966

Change 547169 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update oozie datasets to match dumps import change

https://gerrit.wikimedia.org/r/547169

Change 546966 merged by Elukey:
[operations/puppet@production] Refactor profile::analytics::refinery::job::import_mediawiki_dumps

https://gerrit.wikimedia.org/r/546966

Change 547169 merged by Joal:
[analytics/refinery@master] Update oozie datasets to match dumps import change

https://gerrit.wikimedia.org/r/547169

Checked logs from this morning, they look good (nothing to import yet, but no error)

Did we documented this data is available?

ping @JAllemandou to see if any docs need to be corrected/added

Nuria set the point value for this task to 5.Nov 22 2019, 4:22 PM