Page MenuHomePhabricator

Research Recommendations for Data Loss Estimation
Closed, ResolvedPublic

Description

We asked @MGerlach from the Research team to review our Pageview Data Loss Estimation Recommendations and a few questions around it -

  • Are the estimation techniques we recommend to account for the pageview data loss reasonable? Do you have any objections to our making this recommendation for future use of the pageview data?
  • Are you comfortable with us making a more formal recommendation to staff in the Foundation (to avoid using the uncorrected data, and only using corrected data if possible to make a correction)?

In this task, we are documenting @MGerlach 's recommendations to prioritize and work on it, and break it down to other tasks if needed.

Are the estimation techniques we recommend to account for the pageview data loss reasonable? Do you have any objections to our making this recommendation for future use of the pageview data?

  • Yes. No.
  • The approach to estimate the data loss makes sense to me from what I understand. You calculate the fraction of traffic of the affected nodes in a “normal” month and assume that those nodes would have carried the same fraction of traffic during the data-loss period. You provide evidence for this assumption by doing a test-retest comparison with another “normal” month. You then calculate the average as well as the min/max range during two different periods (in which different nodes were affected).
  • I agree with the set of recommendations
    • If possible, avoid using data from that period is a good first recommendation
    • If using that data, one should correct by an average factor
  • I would suggest adding an explicit comment that this only captures an average and that the data-loss is not completely homogenous
    • In this context, I consider the Method 1 (range) not an alternative method but as more in-depth information about the “error-bar” associated with the average correction factor.
    • More importantly perhaps, we should explicitly point out that the data-loss is not homogenous with respect to geographical region (and by extension also the project) as I understood this can vary from 0% to almost 20% and is thus a much stronger effect than the range in Method 1 would suggest.
    • Are there other factors where the average data-loss varies strongly which we should call out?
  • I like the dashboards in the Resources on slide 13 to estimate the data loss. Could one join the two dashboards and have a single dashboard that allows the same custom filters and then returns the average for the two periods as you describe for Method 2 (and potentially also the min/max-range)
  • You mention some of the affected datasets (e.g. webrequest). If possible, we should try to call out all affected datasets. For example, someone using the unique-devices dataset might not anticipate that this is probably affected too (I assume but am also not 100% sure). What I can think of:
  • I hope you are not going to remove the public pageviews data for that period. While the numbers are not 100% reliable, I would rather see us spending efforts on publicly communicating the data loss and the strategies to correct for them (as you present above). I dont know to which extent the numbers about the average/aggregated data-loss (similar to what is in the dashboards) could be shared publicly.

Are you comfortable with us making a more formal recommendation to staff in the Foundation (to avoid using the uncorrected data, and only using corrected data if possible to make a correction)?

  • Yes. I believe an explicit recommendation is necessary because i) pageviews is one of the few metrics and thus very commonly used; ii) the impact of the data loss so severe such that one might draw wrong conclusions; and iii) the underlying technical issue is not trivial (particular nodes serving particular traffic during particular time) so that it might not be clear how to correct for it (and different people end up using different methodologies leading to inconsistencies). The methodology you used is sound and, from what I understand, is the best we can do to correct for it.

In one of the threads you also asked about: Would you want to share any of this externally with researchers?

  • Yes, definitely. For example, via the wiki-research-l mailing list. Wikipedia’s pageviews are being used extensively by researchers and I dont think there is sufficient awareness about this issue (and how to potentially correct for it).

Event Timeline

mpopov triaged this task as High priority.Nov 29 2022, 6:09 PM

From T323182#8485334:

@odimitrijevic and I have drafted a data quality report about the pageview data loss to go on Wikitech. Olja is working on publishing this report to Wikitech.

For the Wikitech documentation, we are not providing recommendations for correcting the data. This is because the data we use to estimate the impact is not publicly available. There is a request for a dataset that would have corrected pageview data - https://phabricator.wikimedia.org/T310732 - for which we would use the approach recommended in Pageview Data Loss Estimation Recommendations (internal doc) (T314197 ).

We are recommending that people avoid using data from the data loss period. The Wikitech documentation also includes other known affected datasets (including unique devices, as @MGerlach mentioned). We also flag the regional differences in the impact of the data loss (to Martin's point about the data-loss being not completely homogenous).

@MGerlach - @odimitrijevic and I have published information about the data loss on Wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues/2021-06-04_Traffic_Data_Loss. This is part of a new space on Wikitech where we have published reports & recommendations about what to do with the data: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues