[Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Milimetric
	Aug 22 2023, 1:15 PM

Description

This is some knowledge transfer from Antoine and maybe Joseph. @Milimetric has some context but should probably understand more deeply the details

Related Objects
Search...

Status	Assigned	Task
Open	VirginiaPoundstone	T345988 [Epic] XML MediaWiki data dumps for right to fork
Resolved	Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
Open	None	T346147 Generate XML dumps for simplewiki
Resolved	Milimetric	T335862 Implement job to generate Dump XML files
Resolved	Milimetric	T344691 [Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop

Event Timeline

Milimetric created this task.Aug 22 2023, 1:15 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptAug 22 2023, 1:15 PM

xcollazo moved this task from Sprint 0 to Sprint 00 on the Data Products board.Aug 22 2023, 2:19 PM

xcollazo edited projects, added Data Products (Sprint 00); removed Data Products (Sprint 0).

lbowmaker removed a project: Data Engineering and Event Platform Team (Sprint 1).Aug 22 2023, 5:43 PM

VirginiaPoundstone triaged this task as High priority.Aug 23 2023, 10:32 PM

WDoranWMF set the point value for this task to 8.Aug 24 2023, 2:16 PM

Milimetric moved this task from Sprint Backlog to Done on the Data Products (Sprint 00) board.Sep 8 2023, 8:35 PM

I spoke to Antoine, and it turns out this was not really the biggest issue, some spark tuning shrugged off the problem. There are lots of other super interesting details in the XML publishing machinery that's built as part of T335862: Implement job to generate Dump XML files. See code here: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/938941/

As part of this task, maybe I should walk someone else through the code. That would also be useful in answering the question "which of Antoine's TODOs should we prioritize now?" (TODO list in T335862#9145218 and T335862#9150443)

VirginiaPoundstone closed this task as Resolved.Sep 19 2023, 2:34 PM

VirginiaPoundstone moved this task from Done to Sign Off on the Data Products (Sprint 00) board.

VirginiaPoundstone moved this task from Sign Off to Done on the Data Products (Sprint 00) board.

[Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to HadoopClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

[Spike] Understand how "large" pages (with lots of revisions) are problematic when writing XML to Hadoop
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...