Page MenuHomePhabricator

Decide how to handle metric corrections
Closed, ResolvedPublic

Description

Right now, we have an intermediate pageviews_corrected table which we use to apply to a single correction to the page view data (relating to Internet Explorer traffic from Pakistan).

We also apply corrections in Wikicharts for the traffic data loss and unique devices data loss.

We should adopt a standard approach for handling these corrections.

Options:

  • use an intermediate table.
  • integrate it into the SQL queries. Simpler to maintain, but the data won't be available for other uses (e.g. Superset dashboard?)
  • just give up on the corrections altogether and go back to using unmodified page view data. That isn't workable given the importance of correcting for the traffic and unique devices data loss.

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf moved this task from Incoming to Planned Next 2 weeks on the Movement-Insights board.
nshahquinn-wmf renamed this task from Decide what to do with the pageviews_corrected table to Decide how to handle metric corrections.Mar 29 2024, 2:49 AM
nshahquinn-wmf updated the task description. (Show Details)

My new idea is: we apply these corrections as part of calculating metrics and storing them in Data Lake tables.

For example, right now we have the wmf_product.content_interactions table which we use only for the Superset dashboard and which does not include any of the corrections. We will rebuild it with the corrections applied and start using it to calculate our metric values. In this case, the corrections will include putting null values for time periods affected by the 2021-22 traffic data loss.

Storing all our metric results in Data Lake tables will make our "refined" metrics more accessible (shifting us slightly towards having a true data warehouse) and be a good exploration of one strategy for making single pipeline for metrics in the future (i.e. maybe these tables could one day power Wikistats).

nshahquinn-wmf added a subscriber: Hghani.

I consulted @Hghani and he agrees with the approach above.

In today's team meeting we spoke about approach, creating a prototype table with the corrections and lowering the scope to phase it out for each metric area, starting with pageviews or unique devices.