The idea behind this task is to calculate, store and display meta statistics on the MediaWiki history reconstruction, augmentation and denormalization:
- How many mediawiki records we discard/ignore and why
- In the final output, for each field in the denormalized table:
- How many records have a correct an truthful value for this field
- How many records have an unknown/null value for this field because it is missing in the original data
- How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision
This would help in:
- Users of the data better understanding the results of their queries/slices/dices.
- We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
- We improving the algorithm in the future.
A possible implementation could be:
- For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
- For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field.
- Then we could have a wiki page, explaining all issue codes and reason of discard.