This is some knowledge transfer from Antoine and maybe Joseph. @Milimetric has some context but should probably understand more deeply the details
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | VirginiaPoundstone | T345988 [Epic] XML MediaWiki data dumps for right to fork | |||
Resolved | Milimetric | T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark | |||
Open | None | T346147 Generate XML dumps for simplewiki | |||
Resolved | Milimetric | T335862 Implement job to generate Dump XML files | |||
Resolved | Milimetric | T344691 [Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop |
Event Timeline
Comment Actions
I spoke to Antoine, and it turns out this was not really the biggest issue, some spark tuning shrugged off the problem. There are lots of other super interesting details in the XML publishing machinery that's built as part of T335862: Implement job to generate Dump XML files. See code here: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/
As part of this task, maybe I should walk someone else through the code. That would also be useful in answering the question "which of Antoine's TODOs should we prioritize now?" (TODO list in T335862#9145218 and T335862#9150443)