Page MenuHomePhabricator

Pageview definition relies on X-Analytics to determine special pages
Open, LowPublicBUG REPORT

Description

If you call PageviewDefinition::isPageview with /wiki/Special:Anything and an empty X-Analytics header, it will return true even though Special:Anything is not in the include list. We should have some other way of filtering out special pages so that sensitive information doesn't leak out.

See https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#286

Event Timeline

Change 772496 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery/source@master] [WIP] Failing test shows bug with Special:Pages

https://gerrit.wikimedia.org/r/772496

Investigation reveals that special is not set in the x analytics header when the user is logged out. Proof:

presto:wmf>  select count(1)
         ->    from webrequest
         ->   where year=2022 and month=3 and day=23 and hour=10
         ->     and http_status in ('200', '304')
         ->     and normalized_host.project = 'en'
         ->     and uri_path like '%Special:Watchlist%'
         ->     and element_at(x_analytics_map, 'loggedIn') is not null
         ->     and x_analytics_map['loggedIn'] = '1'
         ->     and is_pageview
         -> ;
 _col0 
-------
     0 
(1 row)

presto:wmf> 
presto:wmf>  select count(1)
         ->    from webrequest
         ->   where year=2022 and month=3 and day=23 and hour=10
         ->     and http_status in ('200', '304')
         ->     and normalized_host.project = 'en'
         ->     and uri_path like '%Special:Watchlist%'
         ->     and element_at(x_analytics_map, 'loggedIn') is null
         ->     and not is_pageview
         -> ;
 _col0 
-------
     0 
(1 row)

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try the same thing but with Special:Contributions? I assume the missing X-Analytics.special bug was noticed for some other page, at least--maybe Special:ConfirmEmail?)

The special property is set in WikimediaEventsHooks::onXAnalyticsSetHeader. (1) Following one fragile thread of that logic, SpecialPageFactory::resolveAlias (2) can potentially return a null page name, failing the guard condition to set special, when the page title doesn't appear as a key in getAliasList (3). At that point I lose the scent but there's something particularly fishy about Special:ConfirmEmail optionally appearing in getPageList (4), depending on which page is causing trouble I would look more here.

But the bigger issue is that without special, the pageview is recorded in analytics incorrectly when this fragile logic fails. It would be easy enough to remove the guard condition entirely. What is this guard trying to do in the first place? It seems to be a duplication of the same idea as the SPECIAL_PAGES_ACCEPTED allow list in PageviewDefinition.java, but with the non-robust twist of failing to reject not-special-enough Special pages. Removing seems like the right fix to this drive-by reviewer. Update: we do want the canonical special page name, plus a way of tagging "unknown" specials.

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try the same thing but with Special:Contributions? I assume the missing X-Analytics.special bug was noticed for some other page, at least--maybe Special:ConfirmEmail?)

Ah! Very interesting, thank you. Running the query confirms your guess. But indeed, Special:Contributions doesn't have this problem while amusingly ConfirmEmail has one single stray request.

The special property is set in WikimediaEventsHooks::onXAnalyticsSetHeader. (1) Following one fragile thread of that logic, SpecialPageFactory::resolveAlias (2) can potentially return a null page name, failing the guard condition to set special, when the page title doesn't appear as a key in getAliasList (3). At that point I lose the scent but there's something particularly fishy about Special:ConfirmEmail optionally appearing in getPageList (4), depending on which page is causing trouble I would look more here.

I was looking around there too, I asked the other Adam: T138500#7801956

But the bigger issue is that without special, the pageview is recorded in analytics incorrectly when this fragile logic fails. It would be easy enough to remove the guard condition entirely. What is this guard trying to do in the first place? It seems to be a duplication of the same idea as the SPECIAL_PAGES_ACCEPTED allow list in PageviewDefinition.java, but with the non-robust twist of failing to reject not-special-enough Special pages. Removing seems like the right fix to this drive-by reviewer.

yeah, I agree, special=unknown is still useful in at least pointing out this problem downstream. Omitting it entirely is problematic. Ok, so at least we have one pretty solid fix, I'll wait for @Addshore to weigh in as well.

yeah, I agree, special=unknown is still useful in at least pointing out this problem downstream. Omitting it entirely is problematic. Ok, so at least we have one pretty solid fix, I'll wait for @Addshore to weigh in as well.

Ah good point, that the guard condition can't quite be removed, but it becomes a switch between setting special to the canonical English special page title (I see now this is what resolveAlias is for) and setting it to a magic constant like "unknown". In other words,

	if ( $title->isSpecialPage() ) {
		list( $name, /* $subpage */ ) = MediaWikiServices::getInstance()->getSpecialPageFactory()
			->resolveAlias( $title->getDBkey() );

		$headerItems['special'] = $name ?? 'unknown';
	}

Change 773629 had a related patch set uploaded (by Milimetric; author: Milimetric):

[mediawiki/extensions/WikimediaEvents@master] Set special=unknown on X-Analytics

https://gerrit.wikimedia.org/r/773629

Change 773629 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Set special=unknown on X-Analytics

https://gerrit.wikimedia.org/r/773629

I'm trying to figure out if this was deployed and could use a little help. So I see:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+log/refs/heads/master

Which shows that the patch was merged before wmf/1.39.0-wmf.5 was released on Monday, March 28th. And then I see https://en.wikipedia.org/wiki/Special:Version which says en.wiki is running core at 1.39.0-wmf.6 and WikimediaEvents at 1.2.0 (03a42d6), which is also after this patch.

Based on that, it would seem this is deployed. But I still see plenty of requests where special is not set. So I'm guessing there's some kind of caching, which makes sense because these requests are coming from anon users (for logged-in users, the name of the special page resolves and this is not an issue).

Am I right to think this is cached? And if so, how often are the caches invalidated for special pages?

Are ns and page_id also set? If not, then maybe we're missing the new code because MW_API is true. I'm now getting suspicious about this text from the task description: "an empty X-Analytics header"--the entire header is missing?

Which special page URLs are affected by this bug? I'm asking so that we can look at the cache headers of specific handlers...

Indeed, namespace_id and page_id are not set. The data I'm looking at is like:

uri_path: /wiki/Special:Watchlist
x_analytics_map: {client_port=XXXX, WMF-Last-Access-Global=XXXX, WMF-Last-Access=XXXX, https=1}

In the task description, I just meant that's how I found the bug. As in, if you just pass an empty X-Analytics header, the pageview definition doesn't know it's a special page and so it marks it as a pageview. That's just a limitation of the pageview definition, but if we start looking for pages titles like %Special% or similar, we add a costly regex to every webrequest, so it's a last resort.

I queried to see what other pages show up, and I see Watchlist, MyContributions, MyTalk, MyPage, all without page_id or namespace_id.

So when you say maybe MW_API is true, I see how that would affect the setting of special page, but wouldn't uri_path be like api.php?..? Or is there a way that MW_API can be set to true for a normal request to /wiki/Special:Watchlist?

[…] if we start looking for pages titles like %Special% or similar, we add a costly regex […]

... and only mitigate the issue for English-language wikis!

I think the analytics/refinery code is correct here. If a page title looks like it might be a special page in some language, but MW doesn't respond with the special= X-Analytics header, then it is in all likelihood a page view. E.g. "Speciaal: The Dutch Specialty (fictional opera)" would be an article on any wiki, but on Dutch Wikipedia it would be an unknown special page with the corresponding article instead using a title without the colon for technical reasons.

Or is there a way that MW_API can be set to true for a normal request to /wiki/Special:Watchlist?

No (codesearch).

Indeed, namespace_id and page_id are not set. The data I'm looking at is like:

uri_path: /wiki/Special:Watchlist

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try […])

Here's a few kafkacat results that may be interesting.

$ kafkacat ... webrequest_text ... | fgrep 'Special:Watchlist' | fgrep 'special=' | head
{"uri_host":"en.wikipedia.org",
 "http_method":"GET", "uri_path":"/wiki/Special:Watchlist",
 "uri_query":"",
 ..
 "http_status":"200",
 "content_type":"text/html; charset=UTF-8",
 "x_analytics":"ns=-1;special=Watchlist;loggedIn=1; ...",
 ..
}

So, it is sometimes (and possibly most times) working as expected.

{"uri_host":"en.wikipedia.org",
 "http_method":"GET","uri_path":"/w/index.php",
 "uri_query":"?hideWikibase=1&limit=1000&title=Special:Watchlist&urlversion=2&..",
 ..
 "http_status":"200",
 "content_type":"text/html; charset=UTF-8",
 "x_analytics":"ns=-1;special=Watchlist;loggedIn=1; ..",
 ..
}

... even when there are query parameters.

So what about the ones that are lacking the presumed-expected X-Analytics key?

$ ... |  fgrep '"/wiki/Special:Watchlist' | fgrep -v 'special=' | head
Watchlist action render chunk (204)
{"uri_host":"en.wikipedia.org",
 "http_method":"GET",,"uri_path":"/wiki/Special:Watchlist",
 "uri_query":"?hideWikibase=1&limit=250&urlversion=2&peek=1&from=20220415150605&action=render",
 ..
 "http_status":"204",
 "response_size":0,
 "content_type":"-",
 "x_analytics":"WMF-Last-Access= ..; https=1; client_port=..,"
 ..
}

This is an interesting one. This is custom response via action=render, which does not produce a web page with skin layout and OutputPage, and thus does not have the XAnalytics extension call its generateHeader() function, and thus none of the other hooks run. From MW's perspective, these are as unrelated to page views as e.g. a request to /w/rest.php or /static/foo.html, except we accomodate them through an unusually special "Special" page.

For Special:Watchlist, the action=render route is used by the real-time RCFilters feature in JavaScript. It produces either an empty HTTP 204 response, or a chunk of HTML under HTTP 200:

Watchlist action render chunk (200)
{"http_status":"200",
 "response_size":6599,
 "http_method":"GET",
 "uri_host":"www.wikidata.org",
 "uri_path":"/wiki/Special:Watchlist","uri_query":"?urlversion=2&action=render&enhanced=0",
 "content_type":"text/html; charset=UTF-8",
 ..,
 "x_analytics":"WMF-Last-Access=..; https=1; ..",
 ..
}

The same feature exists on Special:RecentChanges as well.

This is somewhat analogous to the following:

  • Special:Export/Example, this responds with an XML download. Typically this is not linked directly and instead a form is submitted from Special:Export, but this works over GET with a subpage format as well.
  • /wiki/Example?action=render, this is an API to get the raw parser output without the skin, essentially the same as /w/api.php?action=parse&page=Example but without the JSON capsule.
  • /wiki/Example?action=raw&ctype=text/css, this is an API to get the raw wikitext/CSS/JS/JSON source of a page. It exists mainly for importing of user scripts and styles. Ref T279120 for some related tech debt.

I'm mentioning these as these are likely to similarly confuse the analytics/refinery pipeline.

My suggestion would be to build on setup you already have, and go all-in on expecting XAnalytics to communicate what a given response represents. In particular to treat any response without this payload as irrelevant to pageviews, the same as e.g. any request to /w/rest.php, /static/foo.png.

It crossed my mind to wonder why we the refinery is still even looking for pageview candidates among requests that lack any X-Analytics data describing it as a page (whether wiki page or special page). My guess is:

  1. For historical reasons, because we used to not have the XAnalytics extension and instead relied solely on inspecting the URL for the bulk of the traffic stats, and even if it's only a fallback today, changing this may be scary, and understandly seen as not worth doing until now as there was no benefit (now we have fixing this bug as a benefit).
  2. The XAnalytics extension currently provides the page ID of existant wiki pages, and the canonical name of special pages. There are currently a non-zero webrequests that satisfy the overall criteria in analytics/refinery for pageviews yet don't have X-Analytics response header.

The latter is definitely true today. But what I don't know is whether there are any true positives in there, or whether we know today already that it is by definition not a pageview. For example, are we (intentionally) counting pageviews for requests to non-existant wiki page? (e.g. visits to 404 redlinked articles). If we don't currently filter these out by some other criteria already, and if we want to keep that, then before we could hard-require XAnalytics in the pageview definition, we'd need the extension to also leave a trace on those (e.g. it could set pageId=0, and/or title=..).

The result would be that anything passing through MediaWiki core where the response may route through the /wiki/ path, but isn't a user-facing skinned OutputPage response, would naturally be discarded.

Today, at our grooming meeting, we were trying to evaluate whether this task was a "must do", a "should do", or a "could do".

@JArguello-WMF
After reading all comments carefully, it seems to me that we have found a couple imprecisions of the PageviewDefinition,
and identified a couple potential solutions. What would remain to be done IMO is to try out these solutions by:

  1. Changing the PageviewDefinition to include the proposed improvements.
  2. Executing the new code against an existing period of data and vetting the results to confirm that the fix covers the issue and does not affect the pageviews not related to this issue.
  3. If vetting goes well, deploy the solution.
  4. If the impact on data is significant, we should let stakeholders know.

I executed a quick query to proxy the proportion of pageviews affected by this issue:

presto> select count(1)
     ->   from webrequest
     ->  where year=2022 and month=3 and day=23 and hour=10
     ->    and http_status in ('200', '304')
     ->    and normalized_host.project = 'en'
     ->    and uri_path like '%Special:%'
     ->    and element_at(x_analytics_map, 'loggedIn') is null
     ->    and is_pageview;
 _col0 
-------
 23367 
(1 row)

VS.

presto>  select count(1)
     ->    from webrequest
     ->   where year=2022 and month=3 and day=23 and hour=10
     ->     and http_status in ('200', '304')
     ->     and normalized_host.project = 'en'
     ->     and is_pageview;
  _col0   
----------
 11587655 
(1 row)

This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.
I lean towards calling this task a "should do", please disagree!

NOTE: This would be a simple change in code, but a big vetting and testing effort: The PageviewDefinition is responsible for generating the wmf.pageview_hourly table, which is the source data for many of our data sets. Changing the PageviewDefinition might affect lots of pipelines and must be tested thoroughly. On the other hand, we have been talking about computing pageviews using the Event Platform for a while. Maybe this issue is a sign that we should go that way, instead of refining the PageviewDefinition?

[…]
I executed a quick query to proxy the proportion of pageviews affected by this issue:

presto> select count(1)
     ->   from webrequest
     ->  where year=2022 and month=3 and day=23 and hour=10
     ->    and http_status in ('200', '304')
     ->    and normalized_host.project = 'en'
     ->    and uri_path like '%Special:%'
     ->    and element_at(x_analytics_map, 'loggedIn') is null
     ->    and is_pageview;
 _col0 
-------
 23367

[…] This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.

I believe this query might reflect the upperbound of how many views would no longer be counted as such. However, I believe there is one notable addition and one substraction of which I don't know how much they would affect this number. The addition is the various /w/index.php paths involving non-view actions that may satisfy the current pageview conditions (e.g. action=render, which might respond with text/html and have a title, but have no x-analytics header). The substraction is things like SPECIAL_PAGES_ACCEPTED which includes "search", "recentchanges" and "version" which I imagine represent a sizable portion of special-page traffic that I guess we'd continue to include in some form or another.

I admittely do not fully understand how the x_analytics_map value is populated and evaluated, perhaps this is approximating the above in some way. If so, happy to learn :)

Thanks @Krinkle, that is very informative, will help when implementing the fix.

I executed a quick query to proxy the proportion of pageviews affected by this issue:

...

This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.

Minor suggestion: the affected pageviews should be evaluated as a proportion of Special page visits IMHO, which I assume will make the problem look much more significant.

@Milimetric: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!