Page MenuHomePhabricator

[Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop
Closed, ResolvedPublic8 Estimated Story Points

Description

This is some knowledge transfer from Antoine and maybe Joseph. @Milimetric has some context but should probably understand more deeply the details

Event Timeline

WDoranWMF set the point value for this task to 8.Aug 24 2023, 2:16 PM

I spoke to Antoine, and it turns out this was not really the biggest issue, some spark tuning shrugged off the problem. There are lots of other super interesting details in the XML publishing machinery that's built as part of T335862: Implement job to generate Dump XML files. See code here: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/

As part of this task, maybe I should walk someone else through the code. That would also be useful in answering the question "which of Antoine's TODOs should we prioritize now?" (TODO list in T335862#9145218 and T335862#9150443)