Current code from T335862 (available at https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941) , utilizes data from wmf_dumps.wikitext_raw_rc0.
Let's switch it to wmf_dumps.wikitext_raw_rc1.
Current code from T335862 (available at https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941) , utilizes data from wmf_dumps.wikitext_raw_rc0.
Let's switch it to wmf_dumps.wikitext_raw_rc1.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Milimetric | T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark | |||
Open | None | T346147 Generate XML dumps for simplewiki | |||
Resolved | Milimetric | T335862 Implement job to generate Dump XML files | |||
Resolved | Milimetric | T346378 Update XML dump generation code to use wmf_dumps.wikitext_raw_rc1 schema. |
Big progress here with https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/12..13/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikidumper/MediawikiDumper.scala as the long but useful caveat. Basically, our data doesn't match what the XML is doing, and I forced it to work by changing the query. This serves as a useful guide for what we need to fix in the input or transformation (probably the input from XML to HDFS).
Just to wrap this task up, the code that's merged now uses the rc1 schema. This was mostly done by Antoine. Any remaining work on XML publishing has been broken up in separate tasks, all of which are part of epic T347994. This task can be considered done.