In this spike, we want to figure out what are good indicators/metrics to know that we are dumping correct data.
We should get familiar with how existing dumps looks, perhaps by studying latest dump of simplewiki.
Examples of things we may be interested on:
- Compare new dump with existing dump, for, say simplewiki, and figure: what is the size difference? What is the revision range difference? Is a random content item equal? Are the SHA1 hashes equal?
- Are the new dumps valid XML?
- Given that we have two consecutive dumps, how do they differ? You'd expect size to go up?, revision range to go up?
- Is a random item that is supposed to be suppressed actually suppressed?
- Does a random page_id has a 'complete' revision set?