Research Recommendations for Data Loss Estimation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Mayakp.wiki
	Nov 15 2022, 11:41 PM

Description

We asked @MGerlach from the Research team to review our Pageview Data Loss Estimation Recommendations and a few questions around it -

Are the estimation techniques we recommend to account for the pageview data loss reasonable? Do you have any objections to our making this recommendation for future use of the pageview data?
Are you comfortable with us making a more formal recommendation to staff in the Foundation (to avoid using the uncorrected data, and only using corrected data if possible to make a correction)?

In this task, we are documenting @MGerlach 's recommendations to prioritize and work on it, and break it down to other tasks if needed.

Are the estimation techniques we recommend to account for the pageview data loss reasonable? Do you have any objections to our making this recommendation for future use of the pageview data?

Yes. No.
The approach to estimate the data loss makes sense to me from what I understand. You calculate the fraction of traffic of the affected nodes in a “normal” month and assume that those nodes would have carried the same fraction of traffic during the data-loss period. You provide evidence for this assumption by doing a test-retest comparison with another “normal” month. You then calculate the average as well as the min/max range during two different periods (in which different nodes were affected).
I agree with the set of recommendations
- If possible, avoid using data from that period is a good first recommendation
- If using that data, one should correct by an average factor
I would suggest adding an explicit comment that this only captures an average and that the data-loss is not completely homogenous
- In this context, I consider the Method 1 (range) not an alternative method but as more in-depth information about the “error-bar” associated with the average correction factor.
- More importantly perhaps, we should explicitly point out that the data-loss is not homogenous with respect to geographical region (and by extension also the project) as I understood this can vary from 0% to almost 20% and is thus a much stronger effect than the range in Method 1 would suggest.
- Are there other factors where the average data-loss varies strongly which we should call out?
I like the dashboards in the Resources on slide 13 to estimate the data loss. Could one join the two dashboards and have a single dashboard that allows the same custom filters and then returns the average for the two periods as you describe for Method 2 (and potentially also the min/max-range)
You mention some of the affected datasets (e.g. webrequest). If possible, we should try to call out all affected datasets. For example, someone using the unique-devices dataset might not anticipate that this is probably affected too (I assume but am also not 100% sure). What I can think of:
- Everything related to traffic in the Data Lake https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic
- Clickstream data https://dumps.wikimedia.org/other/clickstream/readme.html
I hope you are not going to remove the public pageviews data for that period. While the numbers are not 100% reliable, I would rather see us spending efforts on publicly communicating the data loss and the strategies to correct for them (as you present above). I dont know to which extent the numbers about the average/aggregated data-loss (similar to what is in the dashboards) could be shared publicly.

Are you comfortable with us making a more formal recommendation to staff in the Foundation (to avoid using the uncorrected data, and only using corrected data if possible to make a correction)?

Yes. I believe an explicit recommendation is necessary because i) pageviews is one of the few metrics and thus very commonly used; ii) the impact of the data loss so severe such that one might draw wrong conclusions; and iii) the underlying technical issue is not trivial (particular nodes serving particular traffic during particular time) so that it might not be clear how to correct for it (and different people end up using different methodologies leading to inconsistencies). The methodology you used is sound and, from what I understand, is the best we can do to correct for it.

In one of the threads you also asked about: Would you want to share any of this externally with researchers?

Yes, definitely. For example, via the wiki-research-l mailing list. Wikipedia’s pageviews are being used extensively by researchers and I dont think there is sufficient awareness about this issue (and how to potentially correct for it).

Related Objects
Search...

Status	Assigned	Task
Resolved	Mayakp.wiki	T311560 Pageview Data loss
Resolved	kzimmerman	T314197 Data Loss Estimation Recommendations
Resolved	kzimmerman	T323181 Research Recommendations for Data Loss Estimation

Event Timeline

Mayakp.wiki created this task.Nov 15 2022, 11:41 PM

cchen moved this task from Triage to Kanban on the Product-Analytics board.Nov 22 2022, 6:19 PM

cchen edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

mpopov triaged this task as High priority.Nov 29 2022, 6:09 PM

mpopov mentioned this in T305990: Additional features in Pageview Loss calculation.Dec 6 2022, 5:34 PM

kzimmerman mentioned this in T314197: Data Loss Estimation Recommendations.Dec 21 2022, 5:23 PM

kzimmerman edited parent tasks, added: T314197: Data Loss Estimation Recommendations; removed: T311560: Pageview Data loss.

From T323182#8485334:

@odimitrijevic and I have drafted a data quality report about the pageview data loss to go on Wikitech. Olja is working on publishing this report to Wikitech.

For the Wikitech documentation, we are not providing recommendations for correcting the data. This is because the data we use to estimate the impact is not publicly available. There is a request for a dataset that would have corrected pageview data - https://phabricator.wikimedia.org/T310732 - for which we would use the approach recommended in Pageview Data Loss Estimation Recommendations (internal doc) (T314197 ).

We are recommending that people avoid using data from the data loss period. The Wikitech documentation also includes other known affected datasets (including unique devices, as @MGerlach mentioned). We also flag the regional differences in the impact of the data loss (to Martin's point about the data-loss being not completely homogenous).

@MGerlach - @odimitrijevic and I have published information about the data loss on Wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues/2021-06-04_Traffic_Data_Loss. This is part of a new space on Wikitech where we have published reports & recommendations about what to do with the data: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Data_Issues

kzimmerman moved this task from Doing to Needs Review on the Product-Analytics (Kanban) board.Jan 5 2023, 12:26 AM

kzimmerman closed this task as Resolved.Jan 17 2023, 6:05 PM

kzimmerman moved this task from Needs Review to [Deprecated] Done (previously: Needs sign-off) on the Product-Analytics (Kanban) board.

Research Recommendations for Data Loss EstimationClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Research Recommendations for Data Loss Estimation
Closed, ResolvedPublic
Actions

Related Objects
Search...