Page MenuHomePhabricator

Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie
Closed, ResolvedPublic3 Estimated Story Points

Description

The value "wikimedia" in the project_family field [1] of these tables field does not correspond to an actual project family. Rather, it measures uniques for an accidental agglomerate of multilingual wikis including Commons, Wikispecies, Meta, Incubator, various chapter wikis etc., mixing content and non-content domains.[2]

The reason appears to be that in T138027 the WMF-Last-Access-Global cookie was indiscriminately attached to the second-level domain for all projects, instead of to commons.wikimedia.org, species.wikimedia.org, meta.wikimedia.org etc. in case of these non-language domains.

To resolve this bug, we should change the WMF-Last-Access-Global cookie to the third-level domain in these cases and adjust the queries accordingly.

(Alternatively, we could remove the invalid "wikimedia" data from the tables by excluding the *.wikimedia.org domains in the queries and instead rely on the sum of desktop+mobile per-domain uniques as a substitute for these projects. But to avoid setting unnecessary cookies, we would then want to remove the WMF-Last-Access-Global cookies entirely, so either way VCL changes are unavoidable.)

[1]
e.g.:
SELECT year, month, project_family,
SUM(uniques_estimate) AS uniques
FROM wmf.unique_devices_per_project_family_monthly
WHERE year=2017 AND month = 7
GROUP BY year, month, project_family
ORDER BY year, month, uniques DESC LIMIT 1000; 

year	month	project_family	uniques
2017	7	wikipedia	1361541637
2017	7	wiktionary	42151695
2017	7	wikibooks	10034165
2017	7	wikimedia	8246011
2017	7	wikiquote	7673965
2017	7	wikisource	5342708
2017	7	wikivoyage	1823467
2017	7	wikidata	1649678
2017	7	wikiversity	1527796
2017	7	wikimediafoundation	1194439
2017	7	mediawiki	633302
2017	7	wikinews	512712
12 rows selected (110.212 seconds)
[2]
cf.:

SELECT DISTINCT domain 
FROM wmf.unique_devices_per_domain_monthly
WHERE year=2017 AND month = 7
AND domain LIKE '%wikimedia.org';

domain
ar.m.wikimedia.org
ar.wikimedia.org
bd.m.wikimedia.org
bd.wikimedia.org
be.m.wikimedia.org
be.wikimedia.org
br.m.wikimedia.org
br.wikimedia.org
ca.m.wikimedia.org
ca.wikimedia.org
cn.m.wikimedia.org
cn.wikimedia.org
co.m.wikimedia.org
co.wikimedia.org
commons.m.wikimedia.org
commons.wikimedia.org
dk.m.wikimedia.org
dk.wikimedia.org
ec.wikimedia.org
ee.m.wikimedia.org
ee.wikimedia.org
fi.m.wikimedia.org
fi.wikimedia.org
il.wikimedia.org
incubator.m.wikimedia.org
incubator.wikimedia.org
meta.m.wikimedia.org
meta.wikimedia.org
mk.m.wikimedia.org
mk.wikimedia.org
mx.m.wikimedia.org
mx.wikimedia.org
nl.m.wikimedia.org
nl.wikimedia.org
no.m.wikimedia.org
no.wikimedia.org
nz.m.wikimedia.org
nz.wikimedia.org
outreach.m.wikimedia.org
outreach.wikimedia.org
pl.m.wikimedia.org
pl.wikimedia.org
pt.m.wikimedia.org
pt.wikimedia.org
rs.m.wikimedia.org
rs.wikimedia.org
ru.m.wikimedia.org
ru.wikimedia.org
se.m.wikimedia.org
se.wikimedia.org
species.m.wikimedia.org
species.wikimedia.org
tr.m.wikimedia.org
tr.wikimedia.org
ua.m.wikimedia.org
ua.wikimedia.org
wb.m.wikimedia.org
wb.wikimedia.org
58 rows selected (28.176 seconds)

Event Timeline

Regarding prioritization: While this is a clear bug, it does not affect the (from the Readers team's perspective) most important part of the global uniques data, i.e. the numbers for Wikipedia, and on the traffic side I guess the downsides of including some unnecessary cookies for views to a number of smaller projects can be tolerated for some time.

The model's a bit different in the wikimedia.org case, I'm not even sure there's a rational answer here. Can we get some clear (e.g. pseudo-code level?) guidance on what the desired behavior would be in the wikimedia.org case?

What we're doing today for Set-Cookie purposes is stripping everything left of the 2LD (basically, taking the last two labels of the Host header) to create unique sets, and it's important that your processing of the data use the same rule (e.g. counting all cookies seen for Host: en.m.wikipedia.org to be in the wikipedia.org bucket). So the Set-Cookie processing is setting the scope of the cookie, and the analytics processing takes the same matching scopes into account (if the scopes differed, it would really mess up the stats, as singular WMF-LAG cookies would leak between analytics buckets, or an analytics bucket would contain multiple independent sets of WMF-LAG cookies from different cookie scopes).

While analytics processing could contain arbitrarily-complex scoping for the wikimedia.org case, the cookie Domain attribute cannot. The scope of a Set-Cookie header has to be a singular domain in the DNS sense. So things like wikimedia.org, wikipedia.org, or species.wikimedia.org are all acceptable values, but there's no way to slice up the cookie domain-sets to have one set that covers species.m.wikimedia.org+species.wikimedia.org, and another set that covers "all the projects in wikimedia.org other than species+species.m".

We could today set the cookie scopes at the full Host header value for Set-Cookie+Analytics purposes within wikimedia.org, and you'd end up with separate data sets for all 58 domains listed in the second paste of the ticket description. That's probably the best we can do easily, and there might not be any statistically-valid way to re-combine those results at the end into desktop+mobile per-project. I do wonder, though, given how low-traffic some of the wikimedia.org projects are, if this data wouldn't become "effectively PII" and should be discarded anyways.

As with many such things, the inconsistent use of the m-dot subdomain (well, also any use of this m-dot subdomain pattern) is a major thorn here. If we truly mapped the en.m.wikipedia.org model faithfully into the wikimedia.org space of projects, the mobile species site is currently incorrectly named and should actually be m.species.wikimedia.org (ditto for commons). But that aside, in general the m-dot hostnames are bad practice. There probably should be, orthogonal to this ticket, a proposal/design idea somewhere in the org to work on eliminating the m-dot concept. UA detection (+ cookie overrides as today) and cache-splitting on the main desktop domain should suffice with far less complexity.

I think here we should not think of global unique devices for wikimedia.org domains and rather use just per-domain. We will just remove the *.wikimedia domains from computation as it is meaningless, thus, there shouldn't be any changes to do for traffic team other than removing the Global cookie setting from *.wikimedia domains.

Thanks @BBlack for the detailed explanations :)
As for using the full host header value for wikimedia.org, this is already available as part of the per-domain last-access-uniques cookies I think.

ema triaged this task as Medium priority.Sep 28 2017, 2:52 PM

Change 383171 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Fix oozie jobs loading druid proj-family uniques

https://gerrit.wikimedia.org/r/383171

Change 383171 merged by Nuria:
[analytics/refinery@master] Fix oozie jobs loading druid proj-family uniques

https://gerrit.wikimedia.org/r/383171

The change above doesn't change the behavior of cookies, but at least removes wikimedia project-family from the ones available in Druid. The only place it'll still be visible is in hive (it was already removed from to-be-externaly-published datasets).

Change 383353 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] WMF-Last-Access-Global: not for wikimedia.org

https://gerrit.wikimedia.org/r/383353

Nuria set the point value for this task to 3.Oct 10 2017, 3:11 PM

Change 383353 merged by BBlack:
[operations/puppet@production] WMF-Last-Access-Global: not for wikimedia.org

https://gerrit.wikimedia.org/r/383353

I'm assuming there's nothing left to do here, re-open otherwise!