Meta-statistics on MediaWiki history reconstruction process
The idea behind this task is to calculate, store and display meta statistics on the MediaWiki history reconstruction, augmentation and denormalization:

  1. How many mediawiki records we discard/ignore and why
  2. In the final output, for each field in the denormalized table:
    • How many records have a correct an truthful value for this field
    • How many records have an unknown/null value for this field because it is missing in the original data
    • How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision

This would help in:

  1. Users of the data better understanding the results of their queries/slices/dices.
  2. We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
  3. We improving the algorithm in the future.

A possible implementation could be:

  1. For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
  2. For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field.
  3. Then we could have a wiki page, explaining all issue codes and reason of discard.

