Page MenuHomePhabricator

Many special pages missing from pageview_hourly dataset starting on July 23, 2019
Closed, ResolvedPublic

Description

While reviewing views to special pages for T234559, I noticed that views to a number of commonly viewed special pages such as History, Watchlist, and Contributions are not being recorded in the pageview_hourly data starting around July 23, 2019, leading to a sharp decrease in the daily number of views to special pages at that time. See plot below.

mobile_web_special_page_views.png (1×2 px, 201 KB)

Data via

SELECT 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) AS date,
    access_method,
    SUM(view_count) as special_page_views
FROM wmf.pageview_hourly
    WHERE year = 2019 AND month >= 06
-- look at special pages
    AND namespace_id = -1
    AND agent_type = 'user'
GROUP BY access_method, CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))

A review of the top viewed special pages on English Wikipedia in July includes the commonly viewed special pages but the list appears much different for the month of August. See top 15 for each month below.

July 2019 Top 15 Special Page Views

Data via:

SELECT CONCAT('https://en.wikipedia.org/wiki/',page_root), 
SUM(view_count) AS views FROM (
  SELECT IF(INSTR(page_title,'/')=0,page_title, SUBSTR(page_title,0,INSTR(page_title, '/')-1)) AS page_root,
  view_count
  FROM wmf.pageview_hourly
  WHERE year = 2019 AND month = 7
  AND namespace_id = -1
  AND project = 'en.wikipedia'
  AND agent_type = 'user') AS pr
GROUP BY page_root
ORDER BY views DESC LIMIT 50;

August Top 15 Special Page Views

I reviewed webrequest data as well and confirmed that access requests to all typical special pages are being recorded as expected; however, when you filter by is_pageview there is a similar drop in views on July 23rd and a change in the top viewed pages as seen in the aggregated pageview_hourly data.

Event Timeline

https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/520671/ I'm betting to fix T226730

It was deployed late (EU time) on 2019-07-22, so this being a problem from the day after seems very likely

MNeisler triaged this task as Medium priority.Dec 3 2019, 5:09 PM
MNeisler moved this task from Triage to Tracking on the Product-Analytics board.
MNeisler added a project: Analytics.

This is correct, Special:Page pages other than search should have not been included (there was a long standing bug on the pageview definition code that we fixed on that day that removed those pages). Special:Search should still be recorded and to our knowledge still is.

@Nuria I don't understand the logic of whitelisting only those three special pages (Search, RecentChanges, and Version). The definition page of Meta says that "On the other hand, special pages that users purposefully navigate to, like Special:RecentChanges or Special:Version, are included." (I did write that line, back in 2016, but that was intended as better phrasing of what the page said before, that "automatically-called special pages are excluded").

But those two pages are only examples; there are many other special pages that users purposefully navigate to, like Watchlist, Contributions, and History. And even if we say that those pages don't represent content consumption (an argument which could equally be applied to Talk pages too), surely Special:Book does!

@Neil_P._Quinn_WMF
Indeed it makes sense to include things such us Special:Book. Can you outline the set of pages that you think denote content consumption? (which is a different thing that pages such us Special:Version or "Special: pages that the user purposely navigates to", to my knowledge it was never the intent for those to be counted and were the cause of many spikes on the old pageview definition.

@Neil_P._Quinn_WMF
Indeed it makes sense to include things such us Special:Book. Can you outline the set of pages that you think denote content consumption? (which is a different thing that pages such us Special:Version or "Special: pages that the user purposely navigates to", to my knowledge it was never the intent for those to be counted and were the cause of many spikes on the old pageview definition.

If we decide that we should change the definition in that way, I or someone else on Product Analytics could definitely make a whitelist.

However, about the intent, the part about excluding "automatically-called Special pages" was added to the definition in March 2015 by Os (Oliver) Keyes, and up until now the implementation has always reflected that. What makes you feel that the intent was something different?

Whatever the intent was, I don't how including non-content Special pages like the History is any different than including discussion pages. Neither one is content itself; instead, both are information about the content and how it was created.

Also, it seems like T226730 has relevant context to this task, but I don't think anyone on Product Analytics has access. Is it something you can share?

@Neil_P._Quinn_WMF

Erik's initial definition counted all special pages and did not do much in filtering for bots, this created issues around huge spikes that were driven by programatic constructs of media wiki that would call , for example, Special:HideBanner
from a mediawiki skin (this is just an example, there were others). This is the root of the why of Oliver's comment "automatically-called Special pages" There are many other examples of those "automatically" called pages and pages that are simply just scripts or actions like: https://office.wikimedia.org/wiki/Special:Export Special:CreateAccount or Special:UserLogin that were meant to be excluded.

The code that was excluding special pages from the definition was incomplete and very outdated and we changed the exclusion list to be an inclusion list when security pointed to us that pages like Special:confirmEmail and similar were being reported as pageviews. This is one of the changes: https://github.com/wikimedia/analytics-refinery-source/commit/4b5c129b749bc08d22f8477c0cb01506315dd2ea#diff-618a01f5939249ef8a87b40c4d508011

As i mentioned earlier, the inclusion list can probably use some additions.

Erik's initial definition counted all special pages and did not do much in filtering for bots, this created issues around huge spikes that were driven by programatic constructs of media wiki that would call , for example, Special:HideBanner
from a mediawiki skin (this is just an example, there were others). This is the root of the why of Oliver's comment "automatically-called Special pages" There are many other examples of those "automatically" called pages and pages that are simply just scripts or actions like: https://office.wikimedia.org/wiki/Special:Export Special:CreateAccount or Special:UserLogin that were meant to be excluded.

Thanks for this context!

The code that was excluding special pages from the definition was incomplete and very outdated and we changed the exclusion list to be an inclusion list when security pointed to us that pages like Special:confirmEmail and similar were being reported as pageviews. This is one of the changes: https://github.com/wikimedia/analytics-refinery-source/commit/4b5c129b749bc08d22f8477c0cb01506315dd2ea#diff-618a01f5939249ef8a87b40c4d508011

As i mentioned earlier, the inclusion list can probably use some additions.

Now that I have access to the task, the motivation does make sense, but I still feel that the change was a substantial change to the pageview definition rather than a simple technical correction. I can't really speak to the original intent, but obviously a clear, consistent rule was never set down. So I think we should take you up on your offer to propose such a clear rule! I've filed T240676 for that; I don't know how my team will prioritize it, but in the meantime we can just leave things as they are.