Page MenuHomePhabricator

[Spike] Figure out what are good indicators for dumps data quality
Closed, ResolvedPublic5 Estimated Story Points

Description

In this spike, we want to figure out what are good indicators/metrics to know that we are dumping correct data.

We should get familiar with how existing dumps looks, perhaps by studying latest dump of simplewiki.

Examples of things we may be interested on:

  • Compare new dump with existing dump, for, say simplewiki, and figure: what is the size difference? What is the revision range difference? Is a random content item equal? Are the SHA1 hashes equal?
  • Are the new dumps valid XML?
  • Given that we have two consecutive dumps, how do they differ? You'd expect size to go up?, revision range to go up?
  • Is a random item that is supposed to be suppressed actually suppressed?
  • Does a random page_id has a 'complete' revision set?

Event Timeline

Just a note that sometimes size/page count/visible rev count might go down, if a large batch of pages are deleted for e.g. copyvio (more likely to occur on a small wiki).

Just a note that sometimes size/page count/visible rev count might go down, if a large batch of pages are deleted for e.g. copyvio (more likely to occur on a small wiki).

Thanks @ArielGlenn. Kindly please add to the list anything you think is worth measuring. The long term idea is to run a set of data quality checks before cutting a dumps release.

xcollazo set the point value for this task to 5.Sep 19 2023, 1:44 PM

XML Schema Validation(Dan is already doing this using IntelliJ):

  • If your XML files adhere to a predefined XML schema (XSD), you can validate them against the schema to identify structural differences.
  • Any non-conformance with the schema will be flagged as a difference.

Size and Visual Comparison of the XML:

  • Open the two XML files in a text editor or XML viewer that supports syntax highlighting for easier readability.
  • Manually review the size of the files side by side.

Size and Visual Random Spot Comparison of the tables in HDFS:

  • Using Diff or Minus to compare the Hive table (Mediawiki wikitext history) and the iceberg table (wikitext_raw_rc1)
  • Manually review the size of same partitions(if exist)

Stream Parsing:

  • Comparing/parsing both files in streaming XML Processes

@JEbe-WMF we can do manual checks now that we are still in development, but whenever we are ready to do a real dump, we'd like to have these quality checks being done automatically.

For the checks you mention above, how do you think we could automate them?

@JEbe-WMF - I'm sorry I had this comment but forgot to Submit! Your plan looks good to me, thank you for putting it together.

In response to Xabriel's question on automation, I think we can figure that out after a round or two of applying these checks manually, so we can see what's relevant to see regularly.