Page MenuHomePhabricator

Update unique_devices tables to add the `access_method` field
Closed, ResolvedPublic

Description

The .m subdomain will be removed soon, but the per-domain unique-devices metric heavily relies on it as the .m subdomain is a line in the metric.
While we should have used the access-method from the beginning, better late then never.
Plan:

  • Verify that the split using access_method gives similar summed-up values and per-domain values for per-domain unique-devices table
  • Implement the change (unique-devices per-domain only)
    • Add access_method column to unique_devices_per_domain iceberg tables
    • update unique-devices-per-domain computation accordingly
    • remove old non-iceberg tables for both per-domain and per-rpject-family
    • Update druid loading job for per-domain data (addition of new column)
    • Update cassandra loading job for per-domain data (use new column instead of domain-parsing)
  • Backfill iceberg tables new column in old data using the .m subdomain parsing proxy for mobile web (we won't have the mobile app values backfilled
  • Backfill druid new column using the iceberg table
  • Backfill AQS cassandra tables as the change includes a bug-fix for T299559

Event Timeline

The decision of hacking the metric-definition has been challenged (thanks @Ottomata :)
By looking into details, we wouldn't have to change nor AQS nor Wikistats UI in we change the unique-devices per-domain to be split by a new field as the notion of access-site is already abstracted in AQs (see the loading job: https://github.com/wikimedia/analytics-refinery/blob/master/hql/cassandra/monthly/load_cassandra_unique_devices_monthly.hql#L41).
So we could change the unique-devices per-domain hive and druid tables to, similarly to what AQS does, abstract the notion of mobile vs desktop access in a separate field.
@Milimetric , I could use your input on this proposal :)

If emulating the m-dot domain in older schemas for compatibility, note that adding the m. is significantly harder than detecting or removing it!

I would broadly suggest one of these two approaches:

  1. Explicitly and openly make no attempt to match real domain names and e.g. blindly prepend m. or append .m when x_analytics[ismobile]=1, so you get m.en.wikipedia.org and m.www.wikidata.org instead of en.m.wikipedia.org or m.wikidata.org. None of these will exist in DNS.
  2. Carefully translate the Varnish implementation with a bi-directional code comment pointing to the other one to keep it in sync. Source: puppet/varnish/text-frontend.vcl#mobile_redirect(). This means:
    • keep a list of top-level domains (mediawiki.org, wikidata.org, etc.) where an optional "www" is replaced by an "m".
    • keep a list of top-level domains (wikipedia.org, wiktionary.org, etc.) where a mandatory subdomain is followed by "m".
      • keep a short-list of exempt subdomains of these that do not have a mobile version (e.g. nostalgia.wikipedia.org, thankyou.wikipedia.org, etc)
    • treat wikimedia.org by default as not having a mobile version, and keep a short-list of wikimedia.org subdomains that do have have a mobile version (meta.wikimedia.org, commons.wikimedia.org, etc).
    • treat any other domain as not having a mobile domain.
    • update this ~1x a year when new wikis are created with mobile versions under wikimedia.org or top-level domains, or when a new wiki under a different domain is created without a mobile version.
    • wikis created after 2025 will never have used a m-dot domain and thus may lack a DNS entry or compat-redirect. Long-term this dataset may thus refer to domains that do not actually exist, similar to approach 1. This is TBD. We could enforce in policy that we specifically want to create m-dot on new wikias as well, specifically to allow post-2025 data to be represented in this way with a domain that works if turned back into a URL.

After talking to @Milimetric and @mforns , we are in agreement that it would be better to not emulate a fake .m subdomain, but rather change the underlying unique-devices per-domain table to add the access_method field.This change can also be applied to unique-devices per-project-family but is not mandatory.
As we need to touch the unique-devices jobs, we'd like to take advantage to bundle other changes: remove the non-iceberg hive table to only keep the iceberg one, and remove the druid datasets as the data is small enough in hive to be queried using presto.
Before going in this direction we would like to have approval from @OSefu-WMF 's team.

Hi @JAllemandou, we also think it would be a good idea to add the access_method field to the domain tables.

Ideally, we would like to also add the access_method to the per project family tables as well for future analysis, but we'd like to clarify that it won't lead to any kind of over-counting when we report the total number of unique devices. As we understand the proposed changes, the total unique device count per project family would stay the same except it would be split between mobile, mobile app, and desktop categorizations (i.e., if currently for July it says 1 billion unique devices then after the change we should expect to see 500 million mobile, 300 million desktop and 200 million mobile app)?

I agree that adding an access_method field would be preferable to faking the mobile subdomain as long as the old data is rewritten accordingly (so that mobile domains become the main domain with an access method of "mobile web"). Otherwise, it will become a huge pain to query across two different ways of representing the access method.

But I do wonder: how precisely is this going to work once the mobile web and desktop interfaces set their last-access cookies on the same domain? Previously, if one device visited both sites, it had to be treated as two separate devices because there was no shared domain to set a cookie on. Now, we can either make a special effort to preserve that logic (so that a user who switches interfaces gets counted once in each bucket) or we can take advantage of the now-shared domain to avoid that double counting (so that a user who switches gets counted in just one bucket, based on the interface they're using when they show us with an old last access cookie).

I'm inclined to do the second one. Manual switching is probably very rare and is dwarfed by the automatic switching that happens from desktop users encountering m-dot links. So, even if we preserve the old logic, there will be a step-change in the data because the automatic switching will stop happening. So we might as well switch to the new, simpler and more accurate logic.

Edited on Aug 26 to remove wrong citation at the end.

Hi @JAllemandou, we also think it would be a good idea to add the access_method field to the domain tables.

Ideally, we would like to also add the access_method to the per project family tables as well for future analysis, but we'd like to clarify that it won't lead to any kind of over-counting when we report the total number of unique devices. As we understand the proposed changes, the total unique device count per project family would stay the same except it would be split between mobile, mobile app, and desktop categorizations (i.e., if currently for July it says 1 billion unique devices then after the change we should expect to see 500 million mobile, 300 million desktop and 200 million mobile app)?

You're right: the objective is to have the access_method field be a partition of the previously bundled data.
With the new column

  • for the per-domain metric we will have mobile-web reported instead of having through the .m subdomain, and we will split the rest between desktop and mobile-app (almost no mobile-app in comparison to desktop). As Neil pointed out in his comment, we would gain accuracy by not double-counting devices accessing both .m and not-.m subdomains of the same project.
  • for the per-project-family, the sum across the access_method should match the numbers we find currently.

Now that I have your acceptance, I'm going to generate some test data on both metrics (per-domain and per-project-family) daily (I don't have enough to do it monthly). Hopefully it'll work as expected :)

I have a last demand to both of you @nshahquinn-wmf and @Hghani: Are you ok with us taking advantage of this breaking change to remove both the druid datasource and the hive datasource, keeping only the iceberg one? The idea is to simplify by removing datasources that are the same :)
Thank you for your answers!

I have generated one day of per-domain data in this table:

select *
from joal.unique_devices_per_domain_daily_access_method
where day = '2025-08-18'
limit 10

I have also extracted per-project and access_method numbers in this spreadsheet: https://docs.google.com/spreadsheets/d/17rbYqsgco7te2NVwrl6a9T3POxRzRdMc2X9JKRH7hxY/edit?gid=1178950788#gid=1178950788
The numbers match almost exactly, with at most 0.1% difference, on only a few cases.

May I let you have look @Hghani and @nshahquinn-wmf ?

I the meantime I'll also generate data for the per-project-family table.

Are you ok with us taking advantage of this breaking change to remove both the druid datasource and the hive datasource, keeping only the iceberg one? The idea is to simplify by removing datasources that are the same :)

@JAllemandou , will this work in Turnilo if the druid source is removed?

Are you ok with us taking advantage of this breaking change to remove both the druid datasource and the hive datasource, keeping only the iceberg one? The idea is to simplify by removing datasources that are the same :)

@JAllemandou , will this work in Turnilo if the druid source is removed?

No, if we remove the druid datasource, turnilo will not show it anymore. The idea behind removing druid is to remove the number of tools used to access data, and have people use Superset.
While I write the above, I also completely acknowledge how good turnilo is for easy slice-and-dice vizualization!
But the ultimate goal I pursue is to try to coalesce toward less tools.

hey @JAllemandou , unless there is a bigger Foundation wide initiative to get rid of Turnilo I don't think we should remove this dataset from druid.
As much as I agree on the efficiency use case we are aware of many users (and who are part of Leadership team) that use Turnilo regularly. We would want this data to be available in Turnilo and Druid.

JAllemandou renamed this task from Hack unique_devices_per_domain recreating `.m` subdomain use `x_analytics` `ismobile` value to Update unique_devices tables to add the `access_method` field.Aug 29 2025, 10:05 AM
JAllemandou updated the task description. (Show Details)

I have generated data in joal.unique_devices_per_project_family_daily_access_method table. I have also copied aggregated data in the spreadsheet in a new sheet.
What I observe is that we overcount fresh-sessions. My assumption is that what we define as one actor is sometimes flag as ismobile, sometimes not. But that's just an assumption.

@JAllemandou Thanks for generating the test data. The domain data looks okay to me. I agree with @Mayakp.wiki that we don't want to lose the turnilo data for the reasons she mentioned. I don't think Superset can be treated as a substitute for the easy-to-use turnilo charts yet.

Regarding the project family data, the inflated numbers don't look trivial. Since we don't have any specific need for this breakdown yet, my opinion is that we should keep the project family as-is to avoid the risk of introducing new noise into the data.

@JAllemandou You probably already have them in mind, but just in case, here are a couple implementation thoughts:

  • Please include normalize the trailing dots in the domain names (T395963)
  • As Krinkle already mentioned, mapping the mobile domain name to the main domain name is not always as simple as just removing the m. wherever it appears. The canonical wiki dataset has the correct mappings for all existing wikis. That might not cover all the mobile domain names in the dataset as wikis that have been deleted (which is not the same and much less common than being closed) are removed from the canonical dataset. There are some domain names that never corresponded to a real wiki, but I think none of those had mobile versions.

@JAllemandou Let me know if/when the rollout may continue (from the POV of the unique devices dataset, that is). It looks like you have a draft patch ready, and snapshot with the change applied is being reviewed, is that right?

Depending on this, we might be able to rollout to a few more wikis later this week (T403510). Afaik this task is a blocker, because we don't want any gaps in the data for non-test wikis.

@JAllemandou Thanks for generating the test data. The domain data looks okay to me. I agree with @Mayakp.wiki that we don't want to lose the turnilo data for the reasons she mentioned. I don't think Superset can be treated as a substitute for the easy-to-use turnilo charts yet.

Regarding the project family data, the inflated numbers don't look trivial. Since we don't have any specific need for this breakdown yet, my opinion is that we should keep the project family as-is to avoid the risk of introducing new noise into the data.

Ok, I'll provide patches to apply this changes:

  • Removal of hive non-iceberg data
  • Change to Unique-devices-per-domain
    • addition of access_method field
    • backfilling of the mobile web value using the .m subdomain in existing data
  • Change to druid unique-devices-per-domain dataset to reflect the addition of the access_method field
  • Change to cassandra loading to account for the addition of the new column

I will keep the druid datasources.

@JAllemandou You probably already have them in mind, but just in case, here are a couple implementation thoughts:

  • Please include normalize the trailing dots in the domain names (T395963)

This will be part of the change :)

  • As Krinkle already mentioned, mapping the mobile domain name to the main domain name is not always as simple as just removing the m. wherever it appears. The canonical wiki dataset has the correct mappings for all existing wikis. That might not cover all the mobile domain names in the dataset as wikis that have been deleted (which is not the same and much less common than being closed) are removed from the canonical dataset. There are some domain names that never corresponded to a real wiki, but I think none of those had mobile versions.

We won't generate fake domains so we don't have to worry about those.

@JAllemandou Let me know if/when the rollout may continue (from the POV of the unique devices dataset, that is). It looks like you have a draft patch ready, and snapshot with the change applied is being reviewed, is that right?

Depending on this, we might be able to rollout to a few more wikis later this week (T403510). Afaik this task is a blocker, because we don't want any gaps in the data for non-test wikis.

Now that we have a +1 from analysts, you can rollout as you see fit. I'm interested to know when you start doing it if possible, to backfill the metrics from this date onward.
I expect to be done with the patch and the migration on our end before Friday this week.

Ah I see. I wasn't sure how frequently the dataset is published/used. To confirm- you're saying rolling out further tomorrow, for example, would not harm the Unique Devices dataset?

Change #1186087 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add access_method to unique-devices per-domain

https://gerrit.wikimedia.org/r/1186087

  • As Krinkle already mentioned, mapping the mobile domain name to the main domain name is not always as simple as just removing the m. wherever it appears. The canonical wiki dataset has the correct mappings for all existing wikis. That might not cover all the mobile domain names in the dataset as wikis that have been deleted (which is not the same and much less common than being closed) are removed from the canonical dataset. There are some domain names that never corresponded to a real wiki, but I think none of those had mobile versions.

We won't generate fake domains so we don't have to worry about those.

I was thinking about reshaping the existing data: you'll have to translate any mobile domains you encounter back to the appropriate main domain, which isn't as straightforward as you'd think. But I'm sure you're way ahead of me!

Ah I see. I wasn't sure how frequently the dataset is published/used. To confirm- you're saying rolling out further tomorrow, for example, would not harm the Unique Devices dataset?

If you start rolling out today for instance, the unique-devices for projects you roll out will be impacted, but I will backfill the data once the code has changed.

  • As Krinkle already mentioned, mapping the mobile domain name to the main domain name is not always as simple as just removing the m. wherever it appears. The canonical wiki dataset has the correct mappings for all existing wikis. That might not cover all the mobile domain names in the dataset as wikis that have been deleted (which is not the same and much less common than being closed) are removed from the canonical dataset. There are some domain names that never corresponded to a real wiki, but I think none of those had mobile versions.

We won't generate fake domains so we don't have to worry about those.

I was thinking about reshaping the existing data: you'll have to translate any mobile domains you encounter back to the appropriate main domain, which isn't as straightforward as you'd think. But I'm sure you're way ahead of me!

Ah! I'm definitely not ahead :) I'll reuse the same pattern we used for loading into cassandra: domain RLIKE '(^(m)\\.)|\\.m\\.'

Change #1186087 merged by Joal:

[analytics/refinery@master] Add access_method to unique-devices per-domain

https://gerrit.wikimedia.org/r/1186087

Change #1188401 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Fix unique-devices jobs

https://gerrit.wikimedia.org/r/1188401

Change #1188404 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update turnilo conf for unique-devices per domain

https://gerrit.wikimedia.org/r/1188404

Change #1188404 merged by CDanis:

[operations/puppet@production] Update turnilo conf for unique-devices per domain

https://gerrit.wikimedia.org/r/1188404

Change #1188401 merged by Joal:

[analytics/refinery@master] Fix unique-devices jobs

https://gerrit.wikimedia.org/r/1188401

Change #1189854 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Fix bug in cassandra unique-devices loading

https://gerrit.wikimedia.org/r/1189854

Change #1189854 merged by Joal:

[analytics/refinery@master] Fix bug in cassandra unique-devices loading

https://gerrit.wikimedia.org/r/1189854

@JAllemandou there's actually a bug in the new data: the domains names for Wikidata, Wikifunctions, and MediaWiki.org are missing the the leading "www." (e.g. the data has "wikidata.org" rather than "www.wikidata.org"). The versions with the "www." are canonical: they are what appear in the address bar, in the SiteMatrix, in canonical data, and so on.

This isn't a critical issue, but it would be great to have it corrected before too long as it adds a bit of unnecessary friction to using the data.

While I understand the not-canonical concern, it happens that we've been using the non-www domain since the beginning of unique devices for those domains, at least in cassandra.
The fix is not cheap as not only does it make us change and backfill unique-devices again, but also we'd need to change wikistats and reload cassandra, deleting the old rows (not easy).
I suggest we keep it as is, even if not canonical. Let me know if it feels ok for you :)