Page MenuHomePhabricator

NEW BUG REPORT wmf.interlanguage_navigation missing mobile data
Open, HighPublicBUG REPORT

Description

Data Platform Engineering Bug Report or Data Problem Form.

Please fill out the following
Please ensure you set priority

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a data related problem:
  • Is this a data quality issue? Yes
  • What datasets and/or dashboards are affected? wmf.interlanguage_navigation
  • What are the observed vs expected results?
    • Expected: counts of all daily webrequests that go from one language version of a project to another language version of the same project.
    • Actual: excludes mobile requests (i.e., excludes webrequests in which the referer has ".m" or other qualifier)

See lines 55 and 56 in https://gerrit.wikimedia.org/g/analytics/refinery/+/fee5f29f8f1955f292532e65478bc6eaddea9846/hql/interlanguage/daily/interlanguage_navigation.hql (pasted below)

-- The referer host has no .m, or other qualifiers
AND size(normalize_host(parse_url(referer, 'HOST')).qualifiers) = 0

Solution: Is it possible to fix this to include mobile referrers going forward? And is it possible to retroactively correct the historical data, as well?

Details

Other Assignee
JAllemandou
Related Changes in Gerrit:

Event Timeline

Thanks @CMyrick-WMF! Some additional context for the Data Platform/Engineering folks: we were doing some analyses related to language-switching on Wikipedia and came across this wmf.interlanguage_navigation table, which seemed to have the data that we wanted (counts of how often readers switched between language editions). We then realized that it actually was restricted to just desktop-only data, which misses a lot of language switching done by readers on our mobile platform. This isn't technically a bug in the sense that the desktop-only filter was purposeful at the time when this table was put together (details). There's no current project that depends on this data but it felt important to document this gap somewhere. A few potential options I could see:

  • Consolidate/update the documentation on DataHub (a couple of tables with same name but different databases and I'm not sure which is actually the "right" one), Wikitech, etc. to try to warn folks of this gap.
  • Remove the table if it's no longer actually used to avoid folks drawing incorrect conclusions -- I don't personally love this because I do think it's a useful facet of reader traffic to track and something that isn't otherwise preserved in our clickstream or pageview_hourly datasets. There's some eventlogging for UniversalLanguageSelector (schema) that maybe has some of the data but I don't know all the details there.
  • Update it to have a column for desktop vs. mobile and remove the mobile filter condition so that the dataset is actually more complete going forward.
Isaac renamed this task from NEW BUG REPORT wmf.nterlanguage_navigation missing mobile data to NEW BUG REPORT wmf.interlanguage_navigation missing mobile data.Jun 10 2025, 6:59 PM
Isaac updated the task description. (Show Details)

Another bug we just came across, related to the referer path:

Per https://gerrit.wikimedia.org/g/analytics/refinery/+/fee5f29f8f1955f292532e65478bc6eaddea9846/hql/interlanguage/daily/interlanguage_navigation.hql (lines 57-58)

-- The referer path was something with a /wiki/ beginning, like a normal article path
AND parse_url(referer,'PATH') LIKE '/wiki/%'

The result of this line of code is the exclusion of many (if not most) interlanguage switches. Browsers rarely send the whole referrer anymore; so when someone navigates from an enwiki article to a dewiki article, the referer will not include more than "https://en.m.wikipedia.org/" despite the referer being an article (e.g. https://en.m.wikipedia.org/wiki/Earth)

Change #1172085 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery@master] WIP - migrate, update, and test interlanguage_navigation data

https://gerrit.wikimedia.org/r/1172085

K, @CMyrick-WMF and @Isaac what I'm thinking here is:

  • add a navigation_type column of the form previous-URL-type_to_current-URL-type, or one of mobile_to_mobile / mobile_to_desktop / desktop_to_mobile / desktop_to_desktop
  • delete the old Hive table at wmf.interlanguage_navigation and keep the iceberg table at wmf_traffic (making sure that has all the historical data)

I sent a patch that does kind of that, and also removes the /wiki/ filter as Caroline suggested. Thoughts?

Thanks @Milimetric ! One thought: would it be easier to just record whether the previous URL is mobile or desktop as opposed to the four mobile_to_mobile, mobile_to_desktop, desktop_to_desktop, desktop_to_mobile? The thinking is that the language switch presumably always either honors the previous URL or if it does change, it's not because the user requested it but because e.g., the page automatically redirected to desktop or something like that. So from an analysis perspective, you probably care most about which type of page the person started on (because that tells you which type of UI they were using when they triggered the switch) but where they ended up doesn't tell you anything additional. Hopefully simplifies the code a little bit and is easier to query as well. @CMyrick-WMF I welcome your thoughts as well!

Thanks! Since we're dealing with webrequest data, I was thinking that the addition of an access_method column, which would then be populated with desktop and mobile web (just like in the webrequest table) to keep it as similar looking to the webrequest table. What are ya'll's thoughts about that?

I was thinking that the addition of an access_method column, which would then be populated with desktop and mobile web (just like in the webrequest table) to keep it as similar looking to the webrequest table. What are ya'll's thoughts about that?

Oh yep, that would work for me and probably even simpler! Captures it at the current URL stage but that should still generally be reflective of which UI the person was using when they clicked the link.

k, works for me, updated code in new patch. Just a note that access_method also includes 'mobile_app'. In this case, both current and previous access_methods would both always either BE 'mobile_app' or NOT BE 'mobile_app'. This is because the getAccessMethod UDF first looks at the user agent to determine whether it's an app request, and the user agent is the same across interlanguage navigation. But I think we should keep those results for consistency with other access_method uses. I've made milimetric.interlanguage_navigation data world-readable, so please take a look at the data and tell me what you think. With the new filters there's a lot more of it.

Thanks @Milimetric! I think day should be changed to date to align with current interlanguage_navigation column names. Do you agree?

Otherwise, looks great.

That date -> day rename happened when they made the iceberg version of the table, I'm not sure why, reaching out to Thomas.