Page MenuHomePhabricator

Meta-statistics on MediaWiki history reconstruction process
Closed, ResolvedPublic13 Estimated Story Points

Description

The idea behind this task is to calculate, store and display meta statistics on the MediaWiki history reconstruction, augmentation and denormalization:

  1. How many mediawiki records we discard/ignore and why
  2. In the final output, for each field in the denormalized table:
    • How many records have a correct an truthful value for this field
    • How many records have an unknown/null value for this field because it is missing in the original data
    • How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision

This would help in:

  1. Users of the data better understanding the results of their queries/slices/dices.
  2. We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
  3. We improving the algorithm in the future.

A possible implementation could be:

  1. For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
  2. For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field.
  3. Then we could have a wiki page, explaining all issue codes and reason of discard.

Event Timeline

Nuria triaged this task as Medium priority.Mar 20 2017, 4:24 PM
Nuria raised the priority of this task from Medium to High.Jun 15 2017, 4:27 PM
Nuria moved this task from Wikistats to Operational Excellence Future on the Analytics board.
JAllemandou set the point value for this task to 13.