Pageview definition relies on X-Analytics to determine special pages
Open, LowPublicBUG REPORT
Actions

Assigned To

None

Authored By

	Milimetric
	Mar 21 2022, 7:54 PM

Description

If you call PageviewDefinition::isPageview with /wiki/Special:Anything and an empty X-Analytics header, it will return true even though Special:Anything is not in the include list. We should have some other way of filtering out special pages so that sensitive information doesn't leak out.

See https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/refs/heads/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#286

Details

	Subject	Repo	Branch	Lines +/-
	Set special=unknown on X-Analytics	mediawiki/extensions/WikimediaEvents	master	+2 -4
	[WIP] Failing test shows bug with Special:Pages	analytics/refinery/source	master	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T366004 Add page-title to the x_analytics header
		Open	BUG REPORT	None	T304362 Pageview definition relies on X-Analytics to determine special pages

Event Timeline

Milimetric created this task.Mar 21 2022, 7:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 21 2022, 7:54 PM

Milimetric updated the task description. (Show Details)Mar 21 2022, 7:56 PM

Change 772496 had a related patch set uploaded (by Milimetric; author: Milimetric):

[analytics/refinery/source@master] [WIP] Failing test shows bug with Special:Pages

https://gerrit.wikimedia.org/r/772496

gerritbot added a project: Patch-For-Review.Mar 21 2022, 7:57 PM

Investigation reveals that special is not set in the x analytics header when the user is logged out. Proof:

presto:wmf>  select count(1)
         ->    from webrequest
         ->   where year=2022 and month=3 and day=23 and hour=10
         ->     and http_status in ('200', '304')
         ->     and normalized_host.project = 'en'
         ->     and uri_path like '%Special:Watchlist%'
         ->     and element_at(x_analytics_map, 'loggedIn') is not null
         ->     and x_analytics_map['loggedIn'] = '1'
         ->     and is_pageview
         -> ;
 _col0 
-------
     0 
(1 row)

presto:wmf> 
presto:wmf>  select count(1)
         ->    from webrequest
         ->   where year=2022 and month=3 and day=23 and hour=10
         ->     and http_status in ('200', '304')
         ->     and normalized_host.project = 'en'
         ->     and uri_path like '%Special:Watchlist%'
         ->     and element_at(x_analytics_map, 'loggedIn') is null
         ->     and not is_pageview
         -> ;
 _col0 
-------
     0 
(1 row)

awight subscribed.Mar 23 2022, 9:22 PM

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try the same thing but with Special:Contributions? I assume the missing X-Analytics.special bug was noticed for some other page, at least--maybe Special:ConfirmEmail?)

The special property is set in WikimediaEventsHooks::onXAnalyticsSetHeader. (1) Following one fragile thread of that logic, SpecialPageFactory::resolveAlias (2) can potentially return a null page name, failing the guard condition to set special, when the page title doesn't appear as a key in getAliasList (3). At that point I lose the scent but there's something particularly fishy about Special:ConfirmEmail optionally appearing in getPageList (4), depending on which page is causing trouble I would look more here.

But the bigger issue is that without special, the pageview is recorded in analytics incorrectly when this fragile logic fails. It would be easy enough to remove the guard condition entirely. What is this guard trying to do in the first place? It seems to be a duplication of the same idea as the SPECIAL_PAGES_ACCEPTED allow list in PageviewDefinition.java, but with the non-robust twist of failing to reject not-special-enough Special pages. Removing seems like the right fix to this drive-by reviewer. Update: we do want the canonical special page name, plus a way of tagging "unknown" specials.

In T304362#7801952, @awight wrote:

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try the same thing but with Special:Contributions? I assume the missing X-Analytics.special bug was noticed for some other page, at least--maybe Special:ConfirmEmail?)

Ah! Very interesting, thank you. Running the query confirms your guess. But indeed, Special:Contributions doesn't have this problem while amusingly ConfirmEmail has one single stray request.

The special property is set in WikimediaEventsHooks::onXAnalyticsSetHeader. (1) Following one fragile thread of that logic, SpecialPageFactory::resolveAlias (2) can potentially return a null page name, failing the guard condition to set special, when the page title doesn't appear as a key in getAliasList (3). At that point I lose the scent but there's something particularly fishy about Special:ConfirmEmail optionally appearing in getPageList (4), depending on which page is causing trouble I would look more here.

I was looking around there too, I asked the other Adam: T138500#7801956

But the bigger issue is that without special, the pageview is recorded in analytics incorrectly when this fragile logic fails. It would be easy enough to remove the guard condition entirely. What is this guard trying to do in the first place? It seems to be a duplication of the same idea as the SPECIAL_PAGES_ACCEPTED allow list in PageviewDefinition.java, but with the non-robust twist of failing to reject not-special-enough Special pages. Removing seems like the right fix to this drive-by reviewer.

yeah, I agree, special=unknown is still useful in at least pointing out this problem downstream. Omitting it entirely is problematic. Ok, so at least we have one pretty solid fix, I'll wait for @Addshore to weigh in as well.

Milimetric added a project: Data-Engineering-Kanban.Mar 23 2022, 10:23 PM

Milimetric moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

In T304362#7801976, @Milimetric wrote:

yeah, I agree, special=unknown is still useful in at least pointing out this problem downstream. Omitting it entirely is problematic. Ok, so at least we have one pretty solid fix, I'll wait for @Addshore to weigh in as well.

Ah good point, that the guard condition can't quite be removed, but it becomes a switch between setting special to the canonical English special page title (I see now this is what resolveAlias is for) and setting it to a magic constant like "unknown". In other words,

	if ( $title->isSpecialPage() ) {
		list( $name, /* $subpage */ ) = MediaWikiServices::getInstance()->getSpecialPageFactory()
			->resolveAlias( $title->getDBkey() );

		$headerItems['special'] = $name ?? 'unknown';
	}

• EChetty moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Mar 24 2022, 3:05 PM

• EChetty assigned this task to Milimetric.Mar 24 2022, 5:26 PM

• EChetty moved this task from Ops Week to Datasets on the Data-Engineering board.

Change 773629 had a related patch set uploaded (by Milimetric; author: Milimetric):

[mediawiki/extensions/WikimediaEvents@master] Set special=unknown on X-Analytics

https://gerrit.wikimedia.org/r/773629

Milimetric moved this task from In Progress to In Code Review on the Data-Engineering-Kanban board.Mar 24 2022, 8:26 PM

Change 773629 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Set special=unknown on X-Analytics

https://gerrit.wikimedia.org/r/773629

Milimetric moved this task from In Code Review to Ready to Deploy on the Data-Engineering-Kanban board.Mar 25 2022, 5:32 PM

I'm trying to figure out if this was deployed and could use a little help. So I see:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/+log/refs/heads/master

Which shows that the patch was merged before wmf/1.39.0-wmf.5 was released on Monday, March 28th. And then I see https://en.wikipedia.org/wiki/Special:Version which says en.wiki is running core at 1.39.0-wmf.6 and WikimediaEvents at 1.2.0 (03a42d6), which is also after this patch.

Based on that, it would seem this is deployed. But I still see plenty of requests where special is not set. So I'm guessing there's some kind of caching, which makes sense because these requests are coming from anon users (for logged-in users, the name of the special page resolves and this is not an issue).

Am I right to think this is cached? And if so, how often are the caches invalidated for special pages?

Are ns and page_id also set? If not, then maybe we're missing the new code because MW_API is true. I'm now getting suspicious about this text from the task description: "an empty X-Analytics header"--the entire header is missing?

Which special page URLs are affected by this bug? I'm asking so that we can look at the cache headers of specific handlers...

Indeed, namespace_id and page_id are not set. The data I'm looking at is like:

uri_path: /wiki/Special:Watchlist
x_analytics_map: {client_port=XXXX, WMF-Last-Access-Global=XXXX, WMF-Last-Access=XXXX, https=1}

In the task description, I just meant that's how I found the bug. As in, if you just pass an empty X-Analytics header, the pageview definition doesn't know it's a special page and so it marks it as a pageview. That's just a limitation of the pageview definition, but if we start looking for pages titles like %Special% or similar, we add a costly regex to every webrequest, so it's a last resort.

I queried to see what other pages show up, and I see Watchlist, MyContributions, MyTalk, MyPage, all without page_id or namespace_id.

So when you say maybe MW_API is true, I see how that would affect the setting of special page, but wouldn't uri_path be like api.php?..? Or is there a way that MW_API can be set to true for a normal request to /wiki/Special:Watchlist?

In T304362#7846371, @Milimetric wrote:

[…] if we start looking for pages titles like %Special% or similar, we add a costly regex […]

... and only mitigate the issue for English-language wikis!

I think the analytics/refinery code is correct here. If a page title looks like it might be a special page in some language, but MW doesn't respond with the special= X-Analytics header, then it is in all likelihood a page view. E.g. "Speciaal: The Dutch Specialty (fictional opera)" would be an article on any wiki, but on Dutch Wikipedia it would be an unknown special page with the corresponding article instead using a title without the colon for technical reasons.

Or is there a way that MW_API can be set to true for a normal request to /wiki/Special:Watchlist?

No (codesearch).

In T304362#7846371, @Milimetric wrote:

Indeed, namespace_id and page_id are not set. The data I'm looking at is like:

uri_path: /wiki/Special:Watchlist

In T304362#7801952, @awight wrote:

(Special:Watchlist returns an http status 302 for logged-out users which explains that specific query result. Can you try […])

Here's a few kafkacat results that may be interesting.

$ kafkacat ... webrequest_text ... | fgrep 'Special:Watchlist' | fgrep 'special=' | head

{"uri_host":"en.wikipedia.org",
 "http_method":"GET", "uri_path":"/wiki/Special:Watchlist",
 "uri_query":"",
 ..
 "http_status":"200",
 "content_type":"text/html; charset=UTF-8",
 "x_analytics":"ns=-1;special=Watchlist;loggedIn=1; ...",
 ..
}

So, it is sometimes (and possibly most times) working as expected.

{"uri_host":"en.wikipedia.org",
 "http_method":"GET","uri_path":"/w/index.php",
 "uri_query":"?hideWikibase=1&limit=1000&title=Special:Watchlist&urlversion=2&..",
 ..
 "http_status":"200",
 "content_type":"text/html; charset=UTF-8",
 "x_analytics":"ns=-1;special=Watchlist;loggedIn=1; ..",
 ..
}

... even when there are query parameters.

So what about the ones that are lacking the presumed-expected X-Analytics key?

$ ... |  fgrep '"/wiki/Special:Watchlist' | fgrep -v 'special=' | head

Watchlist action render chunk (204)

{"uri_host":"en.wikipedia.org",
 "http_method":"GET",,"uri_path":"/wiki/Special:Watchlist",
 "uri_query":"?hideWikibase=1&limit=250&urlversion=2&peek=1&from=20220415150605&action=render",
 ..
 "http_status":"204",
 "response_size":0,
 "content_type":"-",
 "x_analytics":"WMF-Last-Access= ..; https=1; client_port=..,"
 ..
}

This is an interesting one. This is custom response via action=render, which does not produce a web page with skin layout and OutputPage, and thus does not have the XAnalytics extension call its generateHeader() function, and thus none of the other hooks run. From MW's perspective, these are as unrelated to page views as e.g. a request to /w/rest.php or /static/foo.html, except we accomodate them through an unusually special "Special" page.

For Special:Watchlist, the action=render route is used by the real-time RCFilters feature in JavaScript. It produces either an empty HTTP 204 response, or a chunk of HTML under HTTP 200:

Watchlist action render chunk (200)

{"http_status":"200",
 "response_size":6599,
 "http_method":"GET",
 "uri_host":"www.wikidata.org",
 "uri_path":"/wiki/Special:Watchlist","uri_query":"?urlversion=2&action=render&enhanced=0",
 "content_type":"text/html; charset=UTF-8",
 ..,
 "x_analytics":"WMF-Last-Access=..; https=1; ..",
 ..
}

The same feature exists on Special:RecentChanges as well.

This is somewhat analogous to the following:

Special:Export/Example, this responds with an XML download. Typically this is not linked directly and instead a form is submitted from Special:Export, but this works over GET with a subpage format as well.
/wiki/Example?action=render, this is an API to get the raw parser output without the skin, essentially the same as /w/api.php?action=parse&page=Example but without the JSON capsule.
/wiki/Example?action=raw&ctype=text/css, this is an API to get the raw wikitext/CSS/JS/JSON source of a page. It exists mainly for importing of user scripts and styles. Ref T279120 for some related tech debt.

I'm mentioning these as these are likely to similarly confuse the analytics/refinery pipeline.

My suggestion would be to build on setup you already have, and go all-in on expecting XAnalytics to communicate what a given response represents. In particular to treat any response without this payload as irrelevant to pageviews, the same as e.g. any request to /w/rest.php, /static/foo.png.

It crossed my mind to wonder why we the refinery is still even looking for pageview candidates among requests that lack any X-Analytics data describing it as a page (whether wiki page or special page). My guess is:

For historical reasons, because we used to not have the XAnalytics extension and instead relied solely on inspecting the URL for the bulk of the traffic stats, and even if it's only a fallback today, changing this may be scary, and understandly seen as not worth doing until now as there was no benefit (now we have fixing this bug as a benefit).
The XAnalytics extension currently provides the page ID of existant wiki pages, and the canonical name of special pages. There are currently a non-zero webrequests that satisfy the overall criteria in analytics/refinery for pageviews yet don't have X-Analytics response header.

The latter is definitely true today. But what I don't know is whether there are any true positives in there, or whether we know today already that it is by definition not a pageview. For example, are we (intentionally) counting pageviews for requests to non-existant wiki page? (e.g. visits to 404 redlinked articles). If we don't currently filter these out by some other criteria already, and if we want to keep that, then before we could hard-require XAnalytics in the pageview definition, we'd need the extension to also leave a trace on those (e.g. it could set pageId=0, and/or title=..).

The result would be that anything passing through MediaWiki core where the response may route through the /wiki/ path, but isn't a user-facing skinned OutputPage response, would naturally be discarded.

Today, at our grooming meeting, we were trying to evaluate whether this task was a "must do", a "should do", or a "could do".

@JArguello-WMF
After reading all comments carefully, it seems to me that we have found a couple imprecisions of the PageviewDefinition,
and identified a couple potential solutions. What would remain to be done IMO is to try out these solutions by:

Changing the PageviewDefinition to include the proposed improvements.
Executing the new code against an existing period of data and vetting the results to confirm that the fix covers the issue and does not affect the pageviews not related to this issue.
If vetting goes well, deploy the solution.
If the impact on data is significant, we should let stakeholders know.

I executed a quick query to proxy the proportion of pageviews affected by this issue:

presto> select count(1)
     ->   from webrequest
     ->  where year=2022 and month=3 and day=23 and hour=10
     ->    and http_status in ('200', '304')
     ->    and normalized_host.project = 'en'
     ->    and uri_path like '%Special:%'
     ->    and element_at(x_analytics_map, 'loggedIn') is null
     ->    and is_pageview;
 _col0 
-------
 23367 
(1 row)

VS.

presto>  select count(1)
     ->    from webrequest
     ->   where year=2022 and month=3 and day=23 and hour=10
     ->     and http_status in ('200', '304')
     ->     and normalized_host.project = 'en'
     ->     and is_pageview;
  _col0   
----------
 11587655 
(1 row)

This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.
I lean towards calling this task a "should do", please disagree!

NOTE: This would be a simple change in code, but a big vetting and testing effort: The PageviewDefinition is responsible for generating the wmf.pageview_hourly table, which is the source data for many of our data sets. Changing the PageviewDefinition might affect lots of pipelines and must be tested thoroughly. On the other hand, we have been talking about computing pageviews using the Event Platform for a while. Maybe this issue is a sign that we should go that way, instead of refining the PageviewDefinition?

In T304362#7878093, @mforns wrote:
[…]
I executed a quick query to proxy the proportion of pageviews affected by this issue:
presto> select count(1)
     ->   from webrequest
     ->  where year=2022 and month=3 and day=23 and hour=10
     ->    and http_status in ('200', '304')
     ->    and normalized_host.project = 'en'
     ->    and uri_path like '%Special:%'
     ->    and element_at(x_analytics_map, 'loggedIn') is null
     ->    and is_pageview;
 _col0 
-------
 23367
[…] This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.

I believe this query might reflect the upperbound of how many views would no longer be counted as such. However, I believe there is one notable addition and one substraction of which I don't know how much they would affect this number. The addition is the various /w/index.php paths involving non-view actions that may satisfy the current pageview conditions (e.g. action=render, which might respond with text/html and have a title, but have no x-analytics header). The substraction is things like SPECIAL_PAGES_ACCEPTED which includes "search", "recentchanges" and "version" which I imagine represent a sizable portion of special-page traffic that I guess we'd continue to include in some form or another.

I admittely do not fully understand how the x_analytics_map value is populated and evaluated, perhaps this is approximating the above in some way. If so, happy to learn :)

Thanks @Krinkle, that is very informative, will help when implementing the fix.

In T304362#7878093, @mforns wrote:

I executed a quick query to proxy the proportion of pageviews affected by this issue:

...

This would indicate the affected pageviews are around 0.2% (in English wiki). Not huge, but also not too small.

Minor suggestion: the affected pageviews should be evaluated as a proportion of Special page visits IMHO, which I assume will make the problem look much more significant.

JArguello-WMF edited projects, added Data-Engineering-Planning (Sprint 01); removed Data-Engineering.Jun 30 2022, 3:51 PM

JArguello-WMF moved this task from Ready to Ready to deploy on the Data-Engineering-Planning (Sprint 01) board.

Milimetric mentioned this in T240676: Develop a consistent rule for which special pages count as pageviews.Jul 1 2022, 5:56 PM

JArguello-WMF removed a project: Data-Engineering-Kanban.Jul 5 2022, 3:53 PM

• EChetty moved this task from Ready to deploy to Paused/Blocked on the Data-Engineering-Planning (Sprint 01) board.Jul 5 2022, 4:12 PM

• EChetty edited projects, added Data-Engineering-Planning; removed Data-Engineering-Planning (Sprint 01).Jul 25 2022, 10:00 AM

• EChetty moved this task from Backlog to To be Discussed on the Data-Engineering-Planning board.Jul 25 2022, 10:02 AM

• EChetty moved this task from To be Discussed to Shared Data Infra on the Data-Engineering-Planning board.Aug 15 2022, 8:42 AM

• EChetty added a project: Shared-Data-Infrastructure.Aug 15 2022, 8:50 AM

BTullis added a project: Data-Platform-SRE.Jun 9 2023, 11:56 AM

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:44 PM

JArguello-WMF added a project: Shared-Data-Infrastructure.Jun 29 2023, 4:13 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 4:14 PM

Gehel moved this task from Incoming to Watching on the Data-Platform-SRE board.Oct 18 2023, 8:50 AM

nshahquinn-wmf subscribed.Mar 15 2024, 10:24 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Apr 12 2024, 11:13 AM

@Milimetric: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Aklapper removed Milimetric as the assignee of this task.May 16 2024, 5:02 PM

nshahquinn-wmf triaged this task as Low priority.May 16 2024, 7:27 PM

nshahquinn-wmf edited projects, added Analytics-Data-Problem; removed Shared-Data-Infrastructure.

Krinkle mentioned this in T366004: Add page-title to the x_analytics header.May 27 2024, 3:17 PM

Milimetric added a parent task: T366004: Add page-title to the x_analytics header.Jun 24 2024, 1:49 PM

Ottomata mentioned this in T371321: [Epic] Instrument pageviews using events, instead of webrequests.Jul 29 2024, 8:06 PM