Page MenuHomePhabricator

Pageviews missing for pages with plus signs in title
Closed, ResolvedPublic

Event Timeline

fdans added subscribers: awight, fdans.

This seems because of @awight 's change on page titles. We'll work on it this week.

fdans triaged this task as Unbreak Now! priority.Jun 3 2019, 3:47 PM
fdans moved this task from Incoming to Analytics Query Service on the Analytics board.

It seems that the pageviews are still recorded, but with the plus sign stripped (or converted to whitespace then trimmed), for example: now has pageviews as . Definitely a bug...

Update: the "44 (band)" link has two orders of magnitude fewer pageviews than "+44 (band)", so it's likely that we're rejecting the data when it comes through the normal route. I don't know why there are any pageviews at all, however.

My first guess is that we might be double-unescaping. The valid title regex is,

^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$

Notably, "+" is not a valid character but "%2B" is allowed. If this is the case, and the encoded entity has been unescaped by the point it hits our pageview definition, it will be rejected.

Reading the suspect patch d7e2b6bc1d69, we indeed URL-decode before checking against the valid title regex. But we use a custom PercentDecoder which does *not* turn plus-sign into space.

Ah ha! I misread $wgLegalTitleChars, it included plus-sign at the end of the list of character ranges, and I must have read as a regex. The fix is simply to add plus sign to our legal char regex in

Change 514235 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/refinery/source@master] Plus signs are valid title characters

Change 514235 merged by Nuria:
[analytics/refinery/source@master] Plus signs are valid title characters