Page MenuHomePhabricator

Investigate wikimedia and wikidata unique devices per-project-family overcount offset
Closed, ResolvedPublic5 Estimated Story Points

Description

While investigating T299559 I found that for the number of unique-devices for the wikidata and wikimedia project-families are a lot bigger than the sum of unique-devices per-domain for all sub-domains of wikidata/wikimedia when the contrary is expected.
this seems to be due to the offset part of the unique-devices metric, which account for users having made a single request to the domain and therefore have no last-visited cookie.

Event Timeline

@JAllemandou thank you for finding this! What do you have in mind for Product Analytics to investigate? I don't think we have much understanding of the inner workings of unique device counting, so I'm not sure we will be able to help much.

odimitrijevic subscribed.

We need to document unique devices metrics and establish ownership

ldelench_wmf moved this task from Triage to Tracking on the Product-Analytics board.

@odimitrijevic and I discussed the priority for this and do not think it should be prioritized above current work. We're not regularly reporting on Wikidata unique devices, in part because we are aware that the data and definitions should be further explored.

Data Engineering are the stewards for the existing unique devices definition. At some point in the future, we would like to revisit the definition and measurement of unique devices to account for changes in technology that may be impacting the measurement of unique devices. Product Analytics would lead the process in partnership with Data Engineering and become the stewards for future definitions. However, we do not currently have the capacity to take this on.

@kzimmerman let's discuss prioritizing. A significantly larger overcount may exist for the wikimedia project family.

JAllemandou renamed this task from wikidata unique devices per-project-family overcount offset to wikimedia and wikidata unique devices per-project-family overcount offset.Dec 6 2022, 7:28 PM
JAllemandou updated the task description. (Show Details)
JArguello-WMF raised the priority of this task from Low to High.
JArguello-WMF set the point value for this task to 5.Dec 7 2022, 5:06 PM

Investigation results:

The overcount affecting unique_devices_per_project_family when compared with unique_devices_per_domain is due to an issue with how we check if webrequests are Special: pages or not mixing up with using only pageviews versus pageviews+redirect-to-pageviews.
The impact on wikidata is not huge, per_domain and per_project_family values have the same order of magnitude (a lot of Special:CentralAutoLogin pages).
The impact on wikimedia project-family is huge: multiplied by 20 between per_domain daily and per_project_familly daily (a BIG lot of banners on metawiki).
BUT: The wikimedia project family is not relevant as is, and should only be provided through per-domain for projects such as commonswiki - we remove the wikimedia project-family from the numbers we publish to the public.

For the Special: pages classification problem a new tasks has been created: https://phabricator.wikimedia.org/T325544

Thank you @JAllemandou! That explains things clearly. I have added the follow up work to the planning board.

JAllemandou renamed this task from wikimedia and wikidata unique devices per-project-family overcount offset to Investigate wikimedia and wikidata unique devices per-project-family overcount offset.Dec 19 2022, 4:01 PM

I believe this is the same problem discussed in https://phabricator.wikimedia.org/T276472. Can they both be closed at the same time?

I believe this is the same problem discussed in https://phabricator.wikimedia.org/T276472. Can they both be closed at the same time?

I'm merging the older one in this one.

Another finding: the WMF-Last-Access-Global cookies are not set for wikimedia projects. So not only do we have wrong numbers for the offsets due to the Special: pages, but we also have wrong numbers for cookie-computed values.
This makes it even clearer: we shouldn't use project-family numbers for the wikimedia family!
I suggest we remove the row altogether from the data with a comment.

This makes it even clearer: we shouldn't use project-family numbers for the wikimedia family!
I suggest we remove the row altogether from the data with a comment.

Thanks @JAllemandou for getting to the bottom of this! So our top-level metrics then would be wikipedia, wiktionary, commonswiki etc.? And none of the alternatives make sense:

  • Filter out the projects lacking WMF-Last-Access-Global cookies from the top-level wikimedia metric -- but maybe this would make the metric misleading because it'd be missing major wikimedia projects?
  • Add support for WMF-Last-Access-Global on those wikis? But maybe this is more complicated than I think or wasn't set for a good reason on those wikis?

I'm assuming the challenge with Special: pages is fixable. For the record, I think that's reasonable to drop wikimedia if the technical issues aren't easily surmountable. It's tempting to report a wikimedia unique devices metric and I feel like I often see requests from Comms etc. for this sort of number. But in reality I assume if you're directly visiting a wikimedia site, you're likely viewing Wikipedia at some point that month too and so it doesn't really give us much new information to also add in wiktionary, commonswiki, etc. and as you're pointing out, opens up the opportunity for bugs and misleading data as we double-count devices.