Page MenuHomePhabricator

Unique devices data uses non-standard domains for Wikidata, Wikifunctions, and MediaWiki.org
Closed, ResolvedPublic

Description

The backfill of unique device data for T401666 changed the domains used to identify Wikidata, Wikifunctions, and MediaWiki.org in the Data Lake datasets (wmf_readership.unique_devices_per_*)

Previously, the canonical main and mobile domains were used (e.g. www.wikidata.org and m.wikidata.org). Now, non-canonical version of the main domains are used (e.g. wikidata.org, without the leading www.).

This adds new friction for folks querying these datasets:

  • They must learn and then remember that unique devices data now uses these non-canonical domains
  • If they wish to join the data with other datasets using the domain as the wiki identifier (which will be increasingly common as the data modelling guidelines recommend it as the primary wiki identifier), they must manually handle these cases which will not join correctly. The canonical wiki dataset could make this a little easier by adding a new unique_device_domain column, but even so it will be cumbersome.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Fix unique_devices canonical domainrepos/data-engineering/airflow-dags!1743joalupdate_unique_devices_per_domainmain
Customize query in GitLab

Event Timeline

While I understand the not-canonical concern, it happens that we've been using the non-www domain since the beginning of unique devices for those domains, at least in cassandra.
The fix is not cheap as not only does it make us change and backfill unique-devices again, but also we'd need to change wikistats and reload cassandra, deleting the old rows (not easy).
I suggest we keep it as is, even if not canonical. Let me know if it feels ok for you :)

I'm not concerned about Cassandra at all: it's never used directly by analysts, nothing has changed, and there's no friction for AQS users as the API already accepts "wikidata", "wikidata.org", or "www.wikidata.org".

But I do think this is something that should be fixed eventually in the Data Lake tables, although I can understand if it needs to be deferred indefinitely (as so many bugs always are 😁) because the fix is expensive. I've tried to explain my reasoning more in the description.

How difficult would it be simply to alter the Data Lake tables and change the Cassandra loading to remove the "www." in the future? We must have already been doing something like that as previously the Data Lake data did have the "www." for the main domain, but it never did in Cassandra.

How difficult would it be simply to alter the Data Lake tables and change the Cassandra loading to remove the "www." in the future? We must have already been doing something like that as previously the Data Lake data did have the "www." for the main domain, but it never did in Cassandra.

We have changed the source-field used for unique-devices computation when adapting to the .m subdomain removal. We now use the project field used in pageviews, which doesn't have the www for the projects you mention. The fix you're demanding would impact not only unique-devices but also pageviews and related datasets. We could hard-code adding www. as a prefix for the domain you mentioned when loading unique-devices, but this would not be future-proof for new domains formatted this way.
Let me know what you think, but to me the cost of change is too high in this case.

Ottomata subscribed.

Grooming: Joseph, we assigned this to you, feel free to resolve or decline as you see fit. Thank you!

We have changed the source-field used for unique-devices computation when adapting to the .m subdomain removal. We now use the project field used in pageviews, which doesn't have the www for the projects you mention.

Thank you, this is helpful context!

The fix you're demanding

Just to be clear, I'm not demanding anything. I'm requesting a fix I think would be useful, and as I understood it, you were kindly taking the time to discuss it and explain your reasoning, even though you have the authority to make the final decision and are already very inclined to say no.

Please let me know if you feel I have overstepped a bound here.

would impact not only unique-devices but also pageviews and related datasets.

I certainly agree we should not change the pageview datasets; their use of a unique code scheme is not ideal, but it has been that way for a long time and the costs of the change would be greater than the benefits. That is different from the case here, where I'm suggesting staying consistent with the previous identification scheme (although it may be, as you say, too expensive to do so).

We could hard-code adding www. as a prefix for the domain you mentioned when loading unique-devices, but this would not be future-proof for new domains formatted this way.

Yes, hard-coding does not feel like a great solution. I can offer the alternative that canonical_data.wikis (which my team is committed to maintaining) has both the pageview code and the canonical domain name, so it could be used to get the correct domain even for wikis which are added in the future.

Let me know what you think, but to me the cost of change is too high in this case.

I would like to see this be done for the reasons I mentioned, but I agree your time is valuable and you are the expert on how much work this would require.

To my naive eye, it seems like it might be relatively simple to (1) leave pageview data unchanged, (2) use canonical data to translate between the pageview code and the canonical domain, (3) take advantage of Iceberg to alter the existing unique device data relatively easily, and (4) strip any leading www. when loading to Cassandra. But, obviously, I could be wildly wrong about that.

I don't have any other potentially-useful input I can provide, so please feel free to make your decision now. I won't be outraged if you decline this.

The fix you're demanding

Just to be clear, I'm not demanding anything. I'm requesting a fix I think would be useful, and as I understood it, you were kindly taking the time to discuss it and explain your reasoning, even though you have the authority to make the final decision and are already very inclined to say no.

I'm very sorry for the misunderstanding here. I made a vocabulary mistake with the French false-friend "demander". I was really not meaning that you were forcing for the change, I'll be careful when using the word "demand" in future.

Please let me know if you feel I have overstepped a bound here.

Absolutely no overstepping a boundary here. It's great you ask for changes, and that we can talk about them :)

To my naive eye, it seems like it might be relatively simple to (1) leave pageview data unchanged, (2) use canonical data to translate between the pageview code and the canonical domain, (3) take advantage of Iceberg to alter the existing unique device data relatively easily, and (4) strip any leading www. when loading to Cassandra. But, obviously, I could be wildly wrong about that.

With the use of the canonical table it's way better (no hard-coding). I'll apply the plan you're mentioning just above :)

I'm very sorry for the misunderstanding here. I made a vocabulary mistake with the French false-friend "demander". I was really not meaning that you were forcing for the change, I'll be careful when using the word "demand" in future.

Now I'm very sorry too! I should have checked with you before assuming so much based on a single word, especially when I've known you for a long time and you've been nothing but kind 😊

To my naive eye, it seems like it might be relatively simple to (1) leave pageview data unchanged, (2) use canonical data to translate between the pageview code and the canonical domain, (3) take advantage of Iceberg to alter the existing unique device data relatively easily, and (4) strip any leading www. when loading to Cassandra. But, obviously, I could be wildly wrong about that.

With the use of the canonical table it's way better (no hard-coding). I'll apply the plan you're mentioning just above :)

Oh, wonderful! I'm so glad it's helpful.

FWIW, I would implement this as taking as the domain from canonical data when found, but falling back to just adding ".org" otherwise. The main reason I saw is that we have unique device data for wikimediafoundation.org (in this case the canonical version in the address bar is without the "www."), which is not found in canonical data because it's not a wiki. There are also a handful of weird domains with a tiny bit of data that seem like a mix of real domains that we don't want and invalid domains that slipped through validation. Nothing that really needs to be worried about (although it wouldn't hurt to just delete that stuff), but adding the ".org" would also do the most reasonable thing in any future cases like that. Here's the full list:

domainall-time total of monthly unique estimateslatest month presentnotes
uz.wikimedia.org1225075.02025-01-01T382730
mo.wikipedia.org136889.02017-12-01
wikipedia.org133538.02020-02-01
als.wiktionary.org9346.02017-12-01
za.wikimedia.org8988.02025-01-01T382730
als.wikiquote.org6126.02017-12-01
als.wikibooks.org4867.02017-12-01
mo.wiktionary.org968.02017-12-01
slo.wikimedia.org359.02025-08-01
wikimedia.org115.02025-08-01
yue.wikibooks.org98.02021-06-01
noc.wikimedia.org< 252025-07-01
download.mediawiki.org< 252025-07-01
zh.wikidata.org< 252022-12-01
yue.wikiquote.org< 252021-04-01
login.wikipedia.org< 252020-11-01
en.wikidata.org< 252025-03-01
zh.wikifunctions.org< 252025-08-01
pk.wikimedia.org< 252023-08-01
ru.wikidata.org< 252017-08-01
meta.wikidata.org< 252022-07-01
fr.wikidata.org< 252022-06-01
d1j57rhr48eenml3mfb0.wikidata.org< 252025-07-01

Structured lists of the domains:

('uz.wikimedia.org', 'mo.wikipedia.org', 'wikipedia.org', 'als.wiktionary.org', 'za.wikimedia.org', 'als.wikiquote.org', 'als.wikibooks.org', 'mo.wiktionary.org', 'slo.wikimedia.org', 'wikimedia.org', 'yue.wikibooks.org', 'noc.wikimedia.org', 'download.mediawiki.org', 'zh.wikidata.org', 'yue.wikiquote.org', 'login.wikipedia.org', 'en.wikidata.org', 'zh.wikifunctions.org', 'pk.wikimedia.org', 'ru.wikidata.org', 'meta.wikidata.org', 'fr.wikidata.org', 'd1j57rhr48eenml3mfb0.wikidata.org')

Change #1194885 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Fix unique-devices per domain domains

https://gerrit.wikimedia.org/r/1194885

Change #1194885 merged by Joal:

[analytics/refinery@master] Fix unique-devices per domain domains

https://gerrit.wikimedia.org/r/1194885

Mentioned in SAL (#wikimedia-analytics) [2025-10-15T16:05:56Z] <mforns> Finished deploying Refinery at 94efa6e8221602a331c19c39ea909eeaa90d98b4 for T405533 unique devices domains

Patch has been deployed, last 2 days of backfill ongoing, calling this done :)