Page MenuHomePhabricator

Pageviews missing for pages with plus signs in title
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 28 2019, 4:23 AM
fdans added subscribers: awight, fdans.

This seems because of @awight 's change on page titles. We'll work on it this week.

fdans triaged this task as Unbreak Now! priority.Jun 3 2019, 3:47 PM
fdans moved this task from Incoming to Analytics Query Service on the Analytics board.
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJun 3 2019, 3:47 PM
awight added a comment.EditedJun 4 2019, 7:35 AM

It seems that the pageviews are still recorded, but with the plus sign stripped (or converted to whitespace then trimmed), for example: https://en.wikipedia.org/wiki/%2B44_(band) now has pageviews as https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-60&pages=44_(band) . Definitely a bug...

Update: the "44 (band)" link has two orders of magnitude fewer pageviews than "+44 (band)", so it's likely that we're rejecting the data when it comes through the normal route. I don't know why there are any pageviews at all, however.

awight added a comment.Jun 4 2019, 7:51 AM

My first guess is that we might be double-unescaping. The valid title regex is,

^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$

Notably, "+" is not a valid character but "%2B" is allowed. If this is the case, and the encoded entity has been unescaped by the point it hits our pageview definition, it will be rejected.

awight added a comment.Jun 4 2019, 8:13 AM

Reading the suspect patch d7e2b6bc1d69, we indeed URL-decode before checking against the valid title regex. But we use a custom PercentDecoder which does *not* turn plus-sign into space.

Ah ha! I misread $wgLegalTitleChars, it included plus-sign at the end of the list of character ranges, and I must have read as a regex. The fix is simply to add plus sign to our legal char regex in PageViewDefinition.java.

Change 514235 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/refinery/source@master] Plus signs are valid title characters

https://gerrit.wikimedia.org/r/514235

Change 514235 merged by Nuria:
[analytics/refinery/source@master] Plus signs are valid title characters

https://gerrit.wikimedia.org/r/514235

Milimetric assigned this task to Nuria.Jun 5 2019, 4:12 PM
mforns moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 6 2019, 2:31 PM
fdans closed this task as Resolved.Jun 10 2019, 3:30 PM