Page MenuHomePhabricator

Add statistics and error logging to mediawiki history reconstruction job
Closed, DuplicatePublic

Description

Marcel's email (minus salutations etc):

Doing the big review (and previous discussions we had) made me think we need a way to count and report on:

  • How many mediawiki records we discard/ignore and why

In the final output, for each field in the denormalized table:

  • How many records have a correct an truthful value for this field
  • How many records have an unknown/null value for this field because it is missing in the original data

-How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision

This would help in:

  • Users of the data better understanding the results of their queries/slices/dices.
  • We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
  • We improving the algorithm in the future.

A possible implementation could be:

  • For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
  • For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field. Then we could have a wiki page, explaining all issue codes and reason of discard.

Well, that was the idea. I didn't write it in the review because I think it is a too long change to be done now. But I think it makes a good follow up task for the immediate future.