Page MenuHomePhabricator

Fix Import of Dumps 1.0 XML into HDFS
Closed, ResolvedPublic1 Estimated Story Points

Description

Epic: T347994

Description

As part of testing our XML publishing job, we noticed that comparing wmf_dumps.wikitext_raw_rc1 to the current XML dumps output shows differences. The backfilling process that reads from wmf.mediawiki_wikitext_history into wmf_dumps.wikitext_raw_rc1 does not transform this data at all, so the current theory is that we miss some nuances when importing the XML to HDFS. Some of these differences can be seen here as hacks to the query that correct them. This task is to track down such differences, fix the ones that are in the Dumps 1.0 import job, and catalog any other differences in a separate task.

Acceptance Criteria

Required

  • Update MediaWikiXMLParserSpec integration tests
  • Announce changes to the XML input job so everyone is aware of the fixes. People are using the imported dumps as part of other pipelines, and the changes may affect them, so check and coordinate the deployment accordingly.

Event Timeline

Change 965792 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery/source@master] Improve fidelity of dumps import

https://gerrit.wikimedia.org/r/965792

Change 966914 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] Update schema of mediawiki_wikitext_*

https://gerrit.wikimedia.org/r/966914

WDoranWMF set the point value for this task to 1.Oct 30 2023, 12:15 PM

Change 966914 merged by Milimetric:

[analytics/refinery@master] Update schema of mediawiki_wikitext_*

https://gerrit.wikimedia.org/r/966914

Change 965792 merged by Milimetric:

[analytics/refinery/source@master] Improve fidelity of dumps import

https://gerrit.wikimedia.org/r/965792