Page MenuHomePhabricator

Add page_id and namespace to X-Analytics header in Mobile App requests (2025 remake)
Closed, ResolvedPublic

Description

For T403660: WE3.3.7 Year in Review and Activity Tab Services - Global Editor Metrics, we are counting pageviews by page_id.

page_id makes its way into the pageview_hourly via the X-Analytics header.

According to pageview_hourly docs, page_id is not currently set for 'mobile app pageviews':

As of 2017-06-12, page_id is populated on access methods desktop and mobile web requests, but not mobile app. This means that >95% of pageview requests have a page_id so far.

If page_id is not set in X-Analytics for mobile app page views, then we will not count views from mobile apps for T403660 - Global Editor Metrics. This is a bit weird, seeing as the Global Editor Metrics API endpoints are being developed to support features in mobile apps themselves.

(Probably the right thing to do is T371321: [Idea] Collect pageview data using client-side instrumentation, but that is a bit out of scope for this task.)


As of 2017-06-12, `page_id is populated [...] not mobile app

In T92875: Add page_id and namespace to X-Analytics header in App / api requests in 2018 in {T92875#4071587} @Krinkle. indicates that page_id should be set. This looks like it was done via ApiMobileView, which I'm not sure is still being used in 2025. What about Page Content Service? Or perhaps something else post RESTbase migration?

Is it possible the 2017 pageview_hourly docs are incorrect and we do set page_id for mobile apps?

See also

Done is

  • Mobile app pageviews are counted in pageview_hourly Data Lake table with page_id and namespace_id fields set.

Event Timeline

cscott subscribed.

I'm assuming that we need to add the XAnalytics special sauce to the REST page html endpoints, since that's where apps are getting content from. PCS is also involved as an intermediary, but core is the one with the page id info.

From what I understand there are 2 missing pieces for adding the analytics for apps:

  1. Have the page id available on page view requests to PCS (PCS level)
  2. Add the missing headers on the request for /page/mobile-html/<title> to PCS (client side level) given the page id from the previous step

If XAnalytics can be added at the PCS level, then wouldn't it eliminate the need to add it at the client side?

I think that webrequest data stream includes requests that are cache hits on edge as well. If we implement it on PCS level that would count only the requests that are cache miss on edge.

If we implement it on PCS level that would count only the requests that are cache miss on edge.

Huh, I assumed X-Analytics would be cached in response headers as well...but I just checked and tried to verify but I couldn't so I maybe they aren't. No, they must be cached, otherwise we'd only have X-Analytics for misses on regular desktop pageviews too.

Also, I believe X-Analytics is a response header. If so, clients can't set it, right?

Ok, that makes more sense now, thanks for the clarification. I thought that X-Analytics was a request header. So yeah, PCS should send it, and make sure that header persists on edge cache.

HSwan-WMF subscribed.

We are unclear what the Reader Growth team would need to change. If someone can provide details on that, then please get back to us and we will look at it again.

Change #1219601 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] WIP: Add pageid in x-analytics

https://gerrit.wikimedia.org/r/1219601

@Milimetric, @Jgiannelos has a question about ismobile and pageview X-Analytics in Add x-analytics header for mobile page views (1219601). Could you answer? Thank you!

Change #1219601 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Add x-analytics header for mobile page views

https://gerrit.wikimedia.org/r/1219601

This should be live after the last deployment

Great okay! I think I see some. It looks like we need a change on the Hadoop side to set the webrequest Hive table fields page_id and namespace_id, but I do see them set in x_analytics on some requests.

Q: here are two "mobile app" pageview requests. Why are they set on one but not the other?

select 
  uri_host, 
  uri_path, 
  page_id, 
  namespace_id, 
  x_analytics 
from 
  wmf.webrequest 
where 
  is_pageview=true 
  and access_method = "mobile app" 
  and year=2026 and month=1 and day=28 and hour=12 
limit 100;
uri_hosturi_pathpage_idnamespace_idx_analytics
en.wikipedia.org/api/rest_v1/page/mobile-html/HymenNULL0pageid=82518;ns=0;xxxxxx;pageview=1;https=1;wmfuuid=xxxxx;ismobile=1;client_port=xxxx;wmfuniq_days=8xxx;wmfuniq_weeks=xxx;wmfuniq_freq=xxx;x_is_browser=xxx;ja3n=xxxxxx;ja4h=xxxxx
de.wikipedia.org/api/rest_v1/page/mobile-html/NachrichtentechnikNULLNULLxxxxx;pageview=1;https=1;wmfuuid=xxxxxx;ismobile=1;client_port=xxxx;sessioncookie=xxx;wmfuniq_days=xxxx;wmfuniq_weeks=xxxx;wmfuniq_freq=xxxx;ja3n=xxxxx;ja4h=xxxxxx

Maybe the second request was cached? Need to check the details.

Update:
Yeah it is cached:

curl --user-agent "WikipediaApp/pcs-unittest" -v -o /dev/null https://mobileapps.svc.codfw.wmnet:4102/de.wikipedia.org/v1/page/mobile-html/Nachrichtentechnik -H "cache-control: no-cache" 2>&1 | grep analytics
< x-analytics: pageid=3576;ns=0;

We have a 7 day TTL so eventually worst case scenario the empty x-analytics responses would be evicted soon.

It looks like in PCS level we add the headers as expected. Should I go ahead and close the ticket or just untag content transform team and track the rest of the work with this ticket?

NICE! I'll check again (when I have time) and close the task. You can untag yourselves if you like. I'll add it to DE sprint and we will track it and re add you if needed.

It looks like data is properly set in x_analytics header. Now we just need to fix logic that hoists it into the top level page_id and namespace_id fields in the webrequest and pageview_hourly tables.

Oh, @Jgiannelos. It should be page_id in the X-Analytics header, not pageid.

Sorry we missed that in review! Can we fix?

Change #1237947 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] analytics: Fix typo on pageid

https://gerrit.wikimedia.org/r/1237947

I just sent a patch. That said we might end up again with stale responses until cache is evicted.

Change #1237947 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] analytics: Fix typo on page_id

https://gerrit.wikimedia.org/r/1237947

@Jgiannelos Just checked a recent hour of data, and I found no instances of page_id in x_analytics. ns is set properly though.

select 
  uri_host, 
  uri_path, 
  page_id, 
  namespace_id, 
  x_analytics 
from 
  wmf.webrequest 
where 
  is_pageview=true 
  and access_method = 'mobile app'
  and year=2026 and month=2 and day=25 and hour=12 
  and x_analytics LIKE '%page_id%'
limit 100;

--The query returned no data

Does the service need to be deployed? Or should the merge have been enough?

I think it works!

Both of these queries return results with namespace_id and page_id set!

select 
  uri_host, 
  uri_path, 
  page_id, 
  namespace_id,
  x_analytics
from 
  wmf.webrequest 
where 
  is_pageview=true 
  and access_method = 'mobile app'
  and year=2026 and month=3 and day=29 and hour=12
limit 100;




select
  project,
  page_title,
  page_id,
  namespace_id,
  view_count
from
  wmf.pageview_hourly
where
  access_method = 'mobile app'
  and year=2026 and month=3 and day=29 and hour=12
limit 100;

I do see some with NULL values, which I didn't exactly expect, but I'm willing to chalk it up to...caching? The few that I checked had neither ns or page_id in the X-Analytics header.

I'm going to be bold and call this done!