Page MenuHomePhabricator

Analyze impact for webrequest and unique devices pipelines to derive access_method without m-dot domain
Closed, ResolvedPublic

Description

Per T214998: RFC: Serve mobile and desktop variants through the same URL (unified mobile routing)

Check with Data Engineering how pageview data computes access_method="mobile web". This may need to be updated to check the mobile header or X-Analytics too. Depending on how this works today, Domain Unification may improve accuracy by no longer misreporting useformat=mobile as "desktop web".

Event Timeline

Heya!

I've put together a spreadsheet with all the datasets we Data Engineering maintain, with information about whether they would be affected by the domain name unification.
https://docs.google.com/spreadsheets/d/17v-xJJHxlOB_gSR0s4v8EoIJStEQv0J5HP8SBQASnkI/edit?usp=sharing

The main takeaways are:

  • Webrequest is affected (as mentioned in the task description), and also many datasets that depend on webrequest. The good news is that we only need to fix the access_method in the refine_webrequest job, and all downstream datasets will be fine. AFAICS, there are no current headers in webrequest that have the information that we need. We could implement a new x-analytics key that told us whether the page is served via desktop or mobile frontend. @phuedx mentioned that this modification would take min 3 days to design and implement. Maybe a bit more for testing and deployment? Once we had that, we'd need to adjust the refine_webrequest job, which would probably take a couple days as well.
  • The Unique Devices pìpeline is affected as well. The bad news is that it's not using access_method, but rather reparsing the dot-m modifier from the domain at each step (data lake, druid, cassandra and wikistats2). So this looks like a bigger refactor. We'd have to modify each step of the pipeline, and the results would not be backwards compatible. If we went for this, I guess it would take a couple of weeks, and more controversy for the Unique Devices metrics. We could do a quick hack, which would be to synthesize the m-dot uri host notation from the access_method in the first step of the pipeline. It would be super hacky. But quick (couple days) and painless and it would keep the metrics as they are today and backwards compatible.
  • Analyses on top of externally-owned event tables could be affected depending on what queries are hitting them. If there are queries parsing the fields containing the uri host for m-dot modifiers to determine whether the event comes from the mobile or desktop frontend, then those queries will need refactoring.
Ottomata renamed this task from Assess data platform implications for RFC domain unification to Assess data platform implications for RFC m. domain name unification.Mar 31 2025, 1:33 PM

Regarding estimation of time necessary to complete this effort:

  1. For x-analytics and webrequest modifications I'd say 2 weeks to complete, 1 engineer.
  1. As for Unique Devices, it depends on whether we choose to go with the hack described in the previous comment, or fix the whole pipeline.
    • The hack would take about 2-3 days (once x-analytics/webrequest is done).
    • Fixing the whole Unique Devices pipeline would take about 1 month for 1 engineer, maybe more? (we'd need to fix the hive job, probably backfill since the start of the dataset, the druid job, the cassandra job, the AQS service endpoint, and Wikistats2 specific code, with all the testing and deployments this would take). Maybe 2 engineers could work on it in parallel and reduce it to 2-3 weeks?
  1. Leaving the changes to externally-owned event datasets out of the estimation.
Krinkle renamed this task from Assess data platform implications for RFC m. domain name unification to Update webrequest refinery to support access_method="mobile web" without m-dot domain.Apr 3 2025, 3:32 AM
Krinkle renamed this task from Update webrequest refinery to support access_method="mobile web" without m-dot domain to Update webrequest and unique devices pipelines to derive access_method without m-dot domain.Apr 3 2025, 6:51 AM

It seems there have been misunderstandings about what the existing access_method does and how even when combined with skin, there are subtle differences for the same client. It would be nice to address this prior to the change.

For example access_method="mobile web" and skin=vector-2022 is a very different experience from access_method="mobile web" and skin=vector and access_method="mobile web" and skin=minerva yet existing analysis seems to consider them all as the same.

The most extreme example can be demonstrated while using an incognito window to compare https://en.m.wikipedia.org/wiki/Paris and https://en.wikipedia.org/wiki/Paris?useskin=minerva - the UI is very different - but both clients log the same access_method and skin. Wikifunctions doesn't have a mobile domain to uses the same app for mobile and desktop devices.

I recommend going forward we add a new field to all schemas that allows the capture of the app (or mediawiki interface) being used to access our content.

@Krinkle 's patch at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/NavigationTiming/+/1133590 reminded me that this is the only way to reliably distinguish how a user is using the site. For webrequests I imagine this might need to inspect the headers?

Hi @Jdlrobson!

For example access_method="mobile web" and skin=vector-2022 is a very different experience from access_method="mobile web" and skin=vector and access_method="mobile web" and skin=minerva yet existing analysis seems to consider them all as the same.

Agree. Whether access_method=mobile_web in wmf.webrequest and derived tables (such as wmf.pageview_hourly) is solely based on the URI host pattern. See the GetAccessMethod UDF. It is not ideal! We could maybe take the opportunity to improve it if we switch to using x-analytics.

The most extreme example can be demonstrated while using an incognito window to compare https://en.m.wikipedia.org/wiki/Paris and https://en.wikipedia.org/wiki/Paris?useskin=minerva - the UI is very different - but both clients log the same access_method and skin. Wikifunctions doesn't have a mobile domain to uses the same app for mobile and desktop devices.

IIUC, both those requests would log different access_method values in wmf.webrequest and derived datasets (mobile_web and desktop respectively). While also, the UI seems mobile-like in both cases. Is that what you meant? Maybe, if we go for the x-analytics approach, we can make sure mobile-fronted is logged whenever a mobile-like skin is used.

Ahoelzl renamed this task from Update webrequest and unique devices pipelines to derive access_method without m-dot domain to Analyze impact for webrequest and unique devices pipelines to derive access_method without m-dot domain.Apr 9 2025, 12:03 AM

Maybe, if we go for the x-analytics approach, we can make sure mobile-fronted is logged whenever a mobile-like skin is used.

That does sound better.

For a true definition of "is mobile site" we'd want to make use of MobileContext::$context->shouldDisplayMobileView() somewhere in our instrumentation code.

For example we add a wgMFMode configuration variable in mobile which is not present in desktop and that's currently used by some instrumentation.

Change #1152310 had a related patch set uploaded (by Jdlrobson; author: Jdlrobson):

[mediawiki/extensions/EventLogging@master] The client_platform_family field should correspond with whether the skin is responsive or not

https://gerrit.wikimedia.org/r/1152310

The above patch demonstrates how I believe we should solve this going forward based on the sorts of analysis I see teams doing incorrectly.

I think use of wgMFMode is not ideal for the reasons I describe in the commit message.

Happy to talk through this more if it's controversial or we have more questions.

Change #1152310 abandoned by Jdlrobson:

[mediawiki/extensions/EventLogging@master] The client_platform_family field should correspond with whether the skin is responsive or not

Reason:

Clearing out open tickets before I leave.

https://gerrit.wikimedia.org/r/1152310

This is now live.

$ curl -I 'https://en.m.wikipedia.org/wiki/Main_Page' -H 'X-Wikimedia-Debug: 1'

x-analytics: …;https=1;ismobile=1;debug=1;…