Page MenuHomePhabricator

Better redirect handling for pageview API
Open, LowPublic

Description

The existence of redirects can cause some confusion over the data when it comes to page views, since it appears that the view only gets counted for the page title of the redirect, even though, in practice, a redirect means that a visitor *views* the target page. There are valid reasons in MediaWiki for the redirect page and the target page to be treated separately, but for many metrics-related use cases, you probably actually want to know the aggregate counts for the page you are querying as well as all redirects to it—or at least have the option to include them with a togglable parameter.

  • I have a list of articles compiled from some source, like a category, contributions list, or WhatLinksHere. One of the articles in this list was recently renamed, so even though I have the current name from the source list, most of its views are counted under the former page name, which is just a redirect I don't know about. In this case, if the current name didn't exist at all within the period I am querying, even though it's a longstanding article that did exist under a previous name, my query gives me an unhelpful 404.
  • I am querying for pageviews because I want to measure the popularity of a particular article, in order to compare it to others. My data won't be as accurate as intended when some articles in the set have very canonical names ("Canada," "Mexico"), while others may be known by several common names redirecting to one article title ("USA" and "United States of America" redirect to "United States"). Sometimes, for Manual of Style reasons, Wikipedia even chooses lesser-used but more correct names for titles. In this case, I would have wanted the option to include all redirects' page views as well, since including "USA" and "United States of America" page views gives me a better sense of the total viewership of "United States" compared to other articles.

In fact, for analytics, it probably makes most sense for redirects to be included by default—with the option, of course, to exclude them if you legitimately want to know the page views of a page without redirects, but that seems rarer. Especially since page moves in an article's history can really mess with your data in unexpected ways.

Event Timeline

Dominicbm raised the priority of this task from to Needs Triage.
Dominicbm updated the task description. (Show Details)
Dominicbm added a project: Analytics.
Dominicbm added a subscriber: Dominicbm.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 18 2015, 9:34 PM
Milimetric moved this task from Incoming to temporary on the Analytics board.Jan 12 2016, 7:27 PM
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:35 PM
Milimetric moved this task from Incoming to temporary on the Analytics board.Jan 12 2016, 7:38 PM
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:43 PM
Milimetric triaged this task as Normal priority.Mar 7 2016, 5:14 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.

I think we can look at this more closely after we get a handle on redirects as part of the wikistats 2.0 data pipeline. Redirects are complicated on mediawiki.

Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.May 16 2017, 12:53 PM
Nuria lowered the priority of this task from Normal to Low.Jun 26 2018, 4:15 PM
Nuria moved this task from Backlog (Later) to Analytics Query Service on the Analytics board.

Hm, @bd808 do you think it would actually be useful to address issues like that one by one or just wait for a better way to group pages with all their redirects? (as in, parsing wikitext historically and building the redirect graph)

Hm, @bd808 do you think it would actually be useful to address issues like that one by one or just wait for a better way to group pages with all their redirects? (as in, parsing wikitext historically and building the redirect graph)

That's a good question. I guess I'll start with a clarifying question: is the idea of having a full redirect graph that accounts for #REDIRECT foo redirects as well as special pages like Special:MyLanguage something that is likely to be fully implemented in the near term (say the Foundation's 2018/19 fiscal year) or is it a long term roadmap project with no planned implementation date? I think my answer would be very much colored by this.

I have a hunch that the full redirect solution will be slow in coming precisely because of your previous "Redirects are complicated on mediawiki." comment and what I know about that topic. If there is some low hanging fruit like Special:MyLanguage prefix and language suffix stripping that can be applied in the processing chain that gets us more useful data (in my opinion) that is currently very difficult to compute otherwise.

The ability to auto-include redirects is a commonly requested feature (T163621, T200256, for example). I built Redirect Views for this purpose, but currently you can only look up one page at a time. Tools like Massviews process thousands of pages. For this we'd need pageviews for the target page + redirects to be precomputed and provided by the API, as doing it programatically would be much too slow and inefficient.

There are some people do want to see pageviews not including redirects (participants of move discussions, etc.), but this seems to be a minority.

I think the full redirect graph accounting for all redirects becomes possible after we ingest and process wikitext content. It's feasible for the 2018/2019 year, and it's something a bunch of us want to get done, but it won't get done unless we're making good progress on our other goals. That's why I hesitate to take care of this special case, lest it lead me down a rabbit hole and delay the more general fix even more.