[Spike] Figure out what are good indicators for dumps data quality
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	xcollazo
	Sep 13 2023, 8:21 PM

Description

In this spike, we want to figure out what are good indicators/metrics to know that we are dumping correct data.

We should get familiar with how existing dumps looks, perhaps by studying latest dump of simplewiki.

Examples of things we may be interested on:

Compare new dump with existing dump, for, say simplewiki, and figure: what is the size difference? What is the revision range difference? Is a random content item equal? Are the SHA1 hashes equal?
Are the new dumps valid XML?
Given that we have two consecutive dumps, how do they differ? You'd expect size to go up?, revision range to go up?
Is a random item that is supposed to be suppressed actually suppressed?
Does a random page_id has a 'complete' revision set?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Milimetric	T330296 Dumps 2.0 Phase I: Proof of concept for MediaWiki XML content dump via Event Platform, Iceberg and Spark
		Resolved		JEbe-WMF	T346279 [Spike] Figure out what are good indicators for dumps data quality

Event Timeline

xcollazo created this task.Sep 13 2023, 8:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2023, 8:21 PM

VirginiaPoundstone moved this task from Incoming to To be discussed/To be estimated on the Experimentation Lab board.Sep 15 2023, 1:20 PM

Just a note that sometimes size/page count/visible rev count might go down, if a large batch of pages are deleted for e.g. copyvio (more likely to occur on a small wiki).

Milimetric assigned this task to JEbe-WMF.Sep 19 2023, 10:05 AM

Milimetric edited projects, added Experimentation Lab (Sprint 01); removed Experimentation Lab.

Milimetric moved this task from Sprint Backlog to In Process on the Experimentation Lab (Sprint 01) board.

In T346279#9177093, @ArielGlenn wrote:

Just a note that sometimes size/page count/visible rev count might go down, if a large batch of pages are deleted for e.g. copyvio (more likely to occur on a small wiki).

Thanks @ArielGlenn. Kindly please add to the list anything you think is worth measuring. The long term idea is to run a set of data quality checks before cutting a dumps release.

xcollazo set the point value for this task to 5.Sep 19 2023, 1:44 PM

VirginiaPoundstone added a project: Dumps 2.0.Sep 20 2023, 3:48 PM

XML Schema Validation(Dan is already doing this using IntelliJ):

If your XML files adhere to a predefined XML schema (XSD), you can validate them against the schema to identify structural differences.
Any non-conformance with the schema will be flagged as a difference.

Size and Visual Comparison of the XML:

Open the two XML files in a text editor or XML viewer that supports syntax highlighting for easier readability.
Manually review the size of the files side by side.

Size and Visual Random Spot Comparison of the tables in HDFS:

Using Diff or Minus to compare the Hive table (Mediawiki wikitext history) and the iceberg table (wikitext_raw_rc1)
Manually review the size of same partitions(if exist)

Stream Parsing:

Comparing/parsing both files in streaming XML Processes

JEbe-WMF moved this task from In Process to In code review / Tech Input on the Experimentation Lab (Sprint 01) board.Oct 2 2023, 5:50 PM

phuedx moved this task from In code review / Tech Input to Wormhole To Sprint 02 on the Experimentation Lab (Sprint 01) board.Oct 3 2023, 1:40 PM

phuedx edited projects, added Experimentation Lab (Sprint 02); removed Experimentation Lab (Sprint 01).

phuedx edited projects, added Experimentation Lab (Sprint 02); removed Experimentation Lab (Sprint 02).

phuedx moved this task from Sprint Backlog to In code review / Tech Input on the Experimentation Lab (Sprint 02) board.Oct 3 2023, 1:42 PM

JEbe-WMF moved this task from In code review / Tech Input to Wormhole To Sprint 02 on the Experimentation Lab (Sprint 02) board.Oct 3 2023, 1:43 PM

JEbe-WMF moved this task from Wormhole To Sprint 02 to In code review / Tech Input on the Experimentation Lab (Sprint 02) board.

@JEbe-WMF we can do manual checks now that we are still in development, but whenever we are ready to do a real dump, we'd like to have these quality checks being done automatically.

For the checks you mention above, how do you think we could automate them?

cjming moved this task from In code review / Tech Input to Sign Off on the Experimentation Lab (Sprint 02) board.Oct 16 2023, 3:43 PM

@JEbe-WMF - I'm sorry I had this comment but forgot to Submit! Your plan looks good to me, thank you for putting it together.

In response to Xabriel's question on automation, I think we can figure that out after a round or two of applying these checks manually, so we can see what's relevant to see regularly.

Milimetric moved this task from Sign Off to Done on the Experimentation Lab (Sprint 02) board.Oct 16 2023, 3:45 PM

xcollazo closed this task as Resolved.Nov 17 2023, 3:43 PM

[Spike] Figure out what are good indicators for dumps data qualityClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

[Spike] Figure out what are good indicators for dumps data quality
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...