Page MenuHomePhabricator

Meta-statistics on MediaWiki history reconstruction process
Closed, ResolvedPublic13 Story Points

Description

The idea behind this task is to calculate, store and display meta statistics on the MediaWiki history reconstruction, augmentation and denormalization:

  1. How many mediawiki records we discard/ignore and why
  2. In the final output, for each field in the denormalized table:
    • How many records have a correct an truthful value for this field
    • How many records have an unknown/null value for this field because it is missing in the original data
    • How many records we gave an 'artificial' value for this field because we didn't know the original value, but could infer a likely one following a design decision

This would help in:

  1. Users of the data better understanding the results of their queries/slices/dices.
  2. We detecting and troubleshooting changes in the data that break our MediaWiki history reconstruction algorithm.
  3. We improving the algorithm in the future.

A possible implementation could be:

  1. For discarded/ignored records, output a row for each discarded record to an error table with the following schema: origin_table, record_id, reason_of_discard
  2. For fields with null or artificial values, have a field in the denormalized table called 'issues' (or similar) of type array<string> and every time the code infers a value from a design decision or detects a missing value, the code should add an issue code (i.e. 452) to that field.
  3. Then we could have a wiki page, explaining all issue codes and reason of discard.

Event Timeline

mforns created this task.Jan 17 2017, 5:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2017, 5:40 PM
Nuria triaged this task as Normal priority.Mar 20 2017, 4:24 PM
Nuria raised the priority of this task from Normal to High.Jun 15 2017, 4:27 PM
JAllemandou set the point value for this task to 13.
JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria closed this task as Resolved.Apr 17 2018, 9:24 PM