Starting on 2019-04-25, pages with + in the title no longer have pageviews. Examples:
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Plus signs are valid title characters | analytics/refinery/source | master | +2 -1 |
Related Objects
- Mentioned In
- T224969: 20% time things for 5/29
- Mentioned Here
- rANRSd7e2b6bc1d69: Reject invalid Page titles
Event Timeline
It seems that the pageviews are still recorded, but with the plus sign stripped (or converted to whitespace then trimmed), for example: https://en.wikipedia.org/wiki/%2B44_(band) now has pageviews as https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-60&pages=44_(band) . Definitely a bug...
Update: the "44 (band)" link has two orders of magnitude fewer pageviews than "+44 (band)", so it's likely that we're rejecting the data when it comes through the normal route. I don't know why there are any pageviews at all, however.
My first guess is that we might be double-unescaping. The valid title regex is,
^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$
Notably, "+" is not a valid character but "%2B" is allowed. If this is the case, and the encoded entity has been unescaped by the point it hits our pageview definition, it will be rejected.
Reading the suspect patch d7e2b6bc1d69, we indeed URL-decode before checking against the valid title regex. But we use a custom PercentDecoder which does *not* turn plus-sign into space.
Ah ha! I misread $wgLegalTitleChars, it included plus-sign at the end of the list of character ranges, and I must have read as a regex. The fix is simply to add plus sign to our legal char regex in PageViewDefinition.java.
Change 514235 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/refinery/source@master] Plus signs are valid title characters
Change 514235 merged by Nuria:
[analytics/refinery/source@master] Plus signs are valid title characters
Super thanks @awight , can you document here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04