Epic: T347994
Description
As part of testing our XML publishing job, we noticed that comparing wmf_dumps.wikitext_raw_rc1 to the current XML dumps output shows differences. The backfilling process that reads from wmf.mediawiki_wikitext_history into wmf_dumps.wikitext_raw_rc1 does not transform this data at all, so the current theory is that we miss some nuances when importing the XML to HDFS. Some of these differences can be seen here as hacks to the query that correct them. This task is to track down such differences, fix the ones that are in the Dumps 1.0 import job, and catalog any other differences in a separate task.
Acceptance Criteria
- MediawikiDumperSpec integration test runs without any query modification
- MediaWikiXMLParserSpec integration test has test cases for the problems found and solved to make the above happen.
Required
- Update MediaWikiXMLParserSpec integration tests
- Announce changes to the XML input job so everyone is aware of the fixes. People are using the imported dumps as part of other pipelines, and the changes may affect them, so check and coordinate the deployment accordingly.