Page MenuHomePhabricator

Final Vetting of Family Wide unique devices data
Closed, ResolvedPublic

Event Timeline

@Tbayer : Do you have a status update on this? Thanks!

I had been looking into this from various angles before Wikimania, including reading through the intricate investigations at T143928 (and the bugs that were uncovered, also regarding the existing per-domain uniques) to understand how we ended up with the final version of the queries, reading through the new documentation (fixing various things there myself and leaving some notes on the talk page), and doing some plausibility checks on the data itself. The monthly numbers for Wikipedia in particular look roughly plausible and consistent with the lower bound estimates we have been using previously (derived from the per-domain data), so we have started quoting them as preliminary data for public purposes. I noticed a bug affecting the data for some sister sites (not Wikipedia), which I just filed as T174640 .
I still plan to do some further consistency checks before closing this task. In particular, check that for all project families, countries and months/days,

unique_devices_project_wide_monthly.uniques_estimate  >= MAX(unique_devices_per_domain_monthly.uniques_estimate)

with the maximum taken over all language versions (also for the sum of desktop+mobile). Presumably there will some smaller countries and families where this is violated just due to some corner cases, but in general it should hold if the method overall is valid. From @JAllemandou I understand that we haven't run consistency checks of this kind so far, so it seems worthwhile.

Ping @Tbayer what is the status of this vetting?

Is there any work to be done here before we can close this task?

I'm still planning to conduct at least the steps outlined above, and will then close this task. While, as mentioned, the monthly numbers for Wikipedia look roughly plausible, and the task thus became less timely at least from the perspective of the core metrics we report every month, there are still some questions - some of them more relevant now that we have started to also report year-over-year changes.. I also found an interesting academic paper, the relevant aspects of which I'm going to summarize here.

Apropos, an answer to the question asked here recently would be useful: T167005#4238082

ping @Tbayer, do you think you could get to this task in the next month?

ping @Tbayer, do you think you could get to this task in the next month?

(Also discussed this in person with @Nuria:) Yes, I'll try to complete this by the end of this month. Sorry that this has been taking a while, it has been a high importance but low urgency task.

Still working on this. Moving over from the discussion about the T167005#4238082, I am interested in the contribution of devices whose "fresh" session includes more than 1 hit, i.e. those that were added to the project-specific uniques in the correction of T167005. Seems it increased the "offset" part of (project-specific) monthly uniques on enwiki by about 60%:

enwiki monthly uniques 2016-01.31..2018-11-01 Turnilo.png (732×1 px, 83 KB)


@Tbayer: do you have some more comments related to vetting of this metric or is this the only one?

@Tbayer: do you have some more comments related to vetting of this metric or is this the only one?

By now I have made several other observations that I need to flesh out and post, but it looks like most of them are about the unique devices metric in general (i.e. also apply to the per-project version). So as not to block the rollout work further, I'll just run the query envisaged in T169550#3568823 and then close this task.