Marcel's email (minus salutations etc):
Doing the big review (and previous discussions we had) made me think we need a way to count and report on:
- How many mediawiki records we discard/ignore and why
In the final output, for each field in the denormalized table:
- How many records have a correct an truthful value for this field
- How many records have an unknown/null value for this field because it is missing in the original data
-How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision
This would help in:
- Users of the data better understanding the results of their queries/slices/dices.
- We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
- We improving the algorithm in the future.
A possible implementation could be:
- For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
- For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field. Then we could have a wiki page, explaining all issue codes and reason of discard.
Well, that was the idea. I didn't write it in the review because I think it is a too long change to be done now. But I think it makes a good follow up task for the immediate future.