Page MenuHomePhabricator

Spike: investigate incorrect page_id values in pageview_hourly
Closed, InvalidPublic

Description

This query gives surprising results:

select count(DISTINCT page_title)
   from pageview_hourly
  where year = 2025 and month = 9 and day = 30 and hour = 23
    and page_id = 12412140

------
1336

In this dataset, the same page_id can have different titles if these titles redirect to this page_id. But in this case looking at the distinct titles shows completely unrelated pages. And this is only true this hour, in other hours this page_id only has a handful of distinct titles, usually redirects.

This value is set here by the WikimediaEvents extension.

The way that hook is run seems to say it should be reliable.

Something's wrong somewhere, the same X-Analytics header is being sent in response to different requests somehow. Can this be cached / accidentally sent somehow?

Slack thread context

Event Timeline

For this hour, I looked at the data and found some low variability in country and access method:

select DISTINCT country, access_method, referer_name
  from pageview_hourly
 where year = 2025 and month = 9 and day = 30 and hour = 23
   and page_id = 12412140

(only two cases - US/desktop/none and NL/mobile web/Google)

In webrequest, this gets even more specific:

select hostname, /* (careful including this) client_ip, */ uri_path, count(*)
  from wmf.webrequest
 where year = 2025 and month = 9 and day = 30 and hour = 23
   and map_key_exists(x_analytics_map, 'page_id')
   and x_analytics_map['page_id'] = '12412140'
 group by hostname, client_ip, uri_path
hostnameclient_ipuri_path_col3
cp2033.codfw.wmnet(one ip)/wiki/Rhacophorus_reinwardtii1
cp3071.esams.wmnet(different single ip)/w/index.php1341

Looking at webrequest in more detail for that specific second IP, all these requests appear to be diffs requested every few seconds during that hour. This actor signature was flagged as automata. The bug still exists, but if someone is looking at the pageview data, excluding 'automata' and 'spider' should yield cleaner data

BPirkle triaged this task as Medium priority.Oct 30 2025, 4:03 PM
BPirkle moved this task from Incoming (Needs Triage) to Backlog on the MW-Interfaces-Team board.
BPirkle subscribed.

I'm unclear on the impact of this bug. If it is more impactful than a "Medium" priority, please bump that up.

For T405039: Global Editor Metrics - Data Pipeline, we are using pageview_hourly to compute editor impact metrics. We wanted to include the page_title in the output dataset, to make the metrics more useable. Since the same page_id is associated with many page_titles, this won't be possible.

But, included page_title in the metric output is not really a requirement, so we can drop it. This is still a bug, but I believe it is not a blocker for T405039 after all. IIUC we can sum view_count over project, page_id and the metric will work.

(Except that it won't count views from mobile apps ;) )

It seems there is more to look into here, but I wrote up the implications for Global Editor Metrics here: at T405039#11329322.

Milimetric closed this task as Invalid.EditedOct 30 2025, 10:34 PM

Useful context from IRC follows below. Basically Timo points out this is not a bug. I had two misunderstandings and there is some unexpected behavior going on here. I'll detail these below and make notes on the pageview definition page about this for when we maybe iterate on it.

  1. Diffs are pageviews, which I vaguely remembered, but it's possible to diff a revision from one page to *another completely different* page! The way these show up makes sense and means there are no "bugs" but unexpected behavior:
    • for example let's look at index.php?diff=1286192464&oldid=1124654206&title=The_Graduate
    • in webrequest, this is parsed as a pageview and we take the title parameter as the page_title, in this case The_Graduate
    • but MW assigns page_id to the page that the revision in the diff= parameter is referring to, in this case this cool frog
  2. redirects are not flagged in our pageview data. We have an is_redirect_to_pageview field, but it misses the most common cases when MW does client-side redirects. These redirects were showing as pageviews to pages with different titles but the same page_id and confusing the queries above.

I am guessing that most of the abuse of these weird cross-page diffs is going to get flagged as automata by our heuristics. As Timo points out below, there are valid use cases. But I personally feel like these shouldn't be pageviews. I've added a note to the pageview definition here.

I also spot checked a lot of the data and didn't find any other kinds of many page_titles to one page_id besides redirects and diff pages. To confirm conclusively, we'd have to join pageview_hourly to something like mediawiki_history to exclude redirects and pageview_actor to pageview_hourly to exclude diffs. Then check if there are any other cases.

         <+Krinkle> milimetric: you mention om task they're diffs, that means they have a uri_query field not shown here, right?
13:06:22 <+Krinkle> The oldid parameter has precedence over thd title parameter
13:07:20 <+Krinkle> w/index.php?title=Banana&oldid=1 will show rev 1, idem for diff
13:07:52 <+Krinkle> The title is most likely what it was at the time but could also be fake
13:08:32 <+Krinkle> If they're diffs, maybe they shouldn't count as pageviews?
13:09:02 <+Krinkle> Or maybe it should emit title from MW instead of the url
13:09:27 <+Krinkle> ie add to x-analytics
13:10:56 <+Krinkle> I don't see a bug here (yet) but the data from different sources may be combined in a confusing/deceptive way
13:45:08 <milimetric> Krinkle - woa, ok, TIL a few things, thank you, will be back with more data
15:15:02 <milimetric> Krinkle: ok, I have returned with more data.  The theory I checked was: "in pageview_hourly, when a single page_id has many distinct page_titles, there are two explanations: diffs to other pages and client-side redirects that we see as 200s in webrequest"
15:15:40 <milimetric> I spot checked lots of data and that claim held.  Most of the "diff" pageviews were done by automata, so it's easy to filter them out
15:16:46 <milimetric> The only way to 100% verify the theory is to join to the page table by title, exclude redirect pages, and see if there's any remaining cases.  But I'm satisfied for now, it looks like there's no bug
15:18:23 <milimetric> well, maybe there's a bug, so here's a question for you: shouldn't MW only allow folks with elevated privileges to do these weird cross-page diffs?  It seems like the use cases for this would mostly coincide with folks who can move pages and have other rights.  Restricting it like that would protect us from garbage requests like we saw here, no?
15:18:57 <milimetric> (I'm following up with research to maybe change the pageview definition to exclude these diffs as well)
15:40:04 <+Krinkle> milimetric: they are not cross page diffs. We try make all query strings with a title for clarity. But on diffs and old revs, the title parameter is usually unused, it serves as fallback if the rev is hidden/deleted.
15:41:19 <+Krinkle> Banana, view history, diff or old: title=Banana&oldid=1. Then rename banana, the old url still works but will show current title. The param is not used, if the rev is valid.
16:04:08 <+Krinkle> It's a bit like slugs in a blog like example.org/123-my-headline/ where the text in the slug isn't strictly used, eg may be cut off, or malformed, or have renamed since, and therefor could be faked as 123-totally-not-a-rickroll
16:05:08 <+Krinkle> Perhaps if we can spare an extra 255 bytes, we can have MW emit page_title alongside page_id in the X header and prefer that when present.
16:05:26 <+Krinkle> But also diffs shouldn't be counted as pageviews maybe?
16:21:38 <+Krinkle> Note that oldids are not diffs without diff parameter, then they are a pageview for a previous revision which is a bit more common and presents the same ambiguity. So we may want to fix it from the MW side regardless.
18:08:15 <milimetric> Krinkle - the diffs I saw were completely different pages, not old names.  Here, example of a pageview from the pageview table from that hour, 2025-09-30T23:
18:08:22 <milimetric> https://en.wikipedia.org/w/index.php?diff=1286192464&oldid=1124654206&title=The_Graduate
18:08:56 <+Krinkle> I see, so the diff rev and the old rev belong to different pages.
18:09:10 <+Krinkle> same as https://en.wikipedia.org/w/index.php?diff=1286192464&oldid=1124654206
18:09:13 <+Krinkle> (without title)
18:09:40 <+Krinkle> I assumed you meant that diff/old were page A and the title page B
18:09:41 <milimetric> yep
18:10:18 <milimetric> yeah what I'm suggesting is a bug here is allowing that without being part of some special permission group
18:10:27 <+Krinkle> Right so that's definitely a cross-page comparison. It's not common but has a few use cases. For example, I might fork a gadget to my user space with an edit, and then you can compare the two. Special:ComparePages allows you to do this via the GUI as well.
18:10:31 <milimetric> because it's an easy vector for DDoSing or just plain being a nuisance
18:10:51 <milimetric> right, I can see the use cases, just maybe good to limit it
18:10:55 <+Krinkle> through history merge and split such urls can also retroactively become that way even if they don't start that way
18:11:04 <+Krinkle> With diffonly=1 there isn't even a page content below the diff
18:11:11 <+Krinkle> so +1 for not treating them as pageviews.
18:11:47 <+Krinkle> although if we emit page_title from X-Analytics, we'll at least consistently attribute them to the (current) title of the right-hand side and (unless diffonly=1) thus the title of the content preview below the diff 
18:12:04 <milimetric> k, that's one good outcome, so then I'll close this.  Do you want me to open up a feature request for limiting cross-page diffs to certain user_groups?
18:12:17 <+Krinkle> idem for non-diff cases like https://en.wikipedia.org/w/index.php?title=The_Graduate&oldid=1124654206 which is actually [[Rhacophorus reinwardtii]]
18:12:32 <+Krinkle> I don't think we want to restrict that.
18:13:01 <milimetric> k, then I'll close the task as invalid.  Thanks for the many TILs :)
18:13:19 <+Krinkle> https://en.wikipedia.org/w/index.php?title=Special:ComparePages
         <+Krinkle> It comes up at times after someone manually moves  a page by copying and pasting, Tools like this can help untangle the mess. Granted, not likely used by anons by proxy of being less experienced, but I don't think we want to restrict that per-se either. Anyway, interesting to think about if it adds up with other downsides.
18:16:32 <+Krinkle> We do at least robots=noindex such diffs. And that GUI is probably not discovrable by anons. But yeah manual URL thinkinering and people hitting it on purpose will still come through. 
18:17:02 <+Krinkle> but those are hard to distinguish from cases where the revs did belong to the same page at some point, or when e.g. someone authorized made the link and then shares it with you

Thank you so much for the very clear summary @Milimetric !
Some notes:

To confirm conclusively, we'd have to join pageview_hourly to something like mediawiki_history to exclude redirects and pageview_actor to pageview_hourly to exclude diffs. Then check if there are any other cases.

  • The join with mediawiki_history can be done on all needed time, but the redirect information is only for page current-state, not historical
  • pageview_hourly being extracted from pageview_actor, there is no need to join here, but rather use pageview_actor instead of pageview_hourly :) Also, pageview_actor contains only 90 days of data, it cannot be used for backfilling.
  • The join with mediawiki_history can be done on all needed time, but the redirect information is only for page current-state, not historical

right. We would either have to accept that we'll get a few of these wrong (is_redirect is more stable than not historically) or join to mediawiki_content_history and actually look at the wikitext. This last idea could also just be done once to populate a historified version of is_redirect and then kept up to date via events.

  • pageview_hourly being extracted from pageview_actor, there is no need to join here, but rather use pageview_actor instead of pageview_hourly :) Also, pageview_actor contains only 90 days of data, it cannot be used for backfilling.

right, of course, I think I was thinking we could look through 90 days of pageview_actor, find all the cross-page diffs, and then see how many are non-automata and see if there's a way we can distinguish them from the limited context we have in pageview_hourly.

For diffs: could we not just modify the pageview algorithm and add an is_diff or pageview_kind=diff field that indicates if it was a diff pageview? We should know pretty easily by the URI path.

pageview_kind (name to be bikeshed) could be useful, if apps start adding page_ids to their pageview requests (and we can find them in webrequests), we could do e.g. pageview_kind=ios_app.

For diffs: could we not just modify the pageview algorithm and add an is_diff or pageview_kind=diff field that indicates if it was a diff pageview? We should know pretty easily by the URI path.

pageview_kind (name to be bikeshed) could be useful, if apps start adding page_ids to their pageview requests (and we can find them in webrequests), we could do e.g. pageview_kind=ios_app.

I think those ideas should be discussed with data-product people too :)