Page MenuHomePhabricator

python-mwviews does not handle unicode in titles
Closed, ResolvedPublic

Description

As speculated in a comment in the source code, python-mwviews does not correctly handle unicode input.

For example: the Swagger endpoint correctly handles the input that it cites as an example of the need for good encoding, "Are_You_the_One?". PageViewsClient does not. It also does not handle the URIs that I am interested in: subpages, e.g. "Wikipedia:Wikipedia_Signpost/2016-01-06/News_and_notes".

Here is a notebook summarizing the issue.

Event Timeline

ResMar created this task.Jan 10 2016, 8:05 PM
ResMar raised the priority of this task from to Needs Triage.
ResMar updated the task description. (Show Details)
ResMar added a project: Research.
ResMar added a subscriber: ResMar.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 10 2016, 8:05 PM
ResMar updated the task description. (Show Details)Jan 10 2016, 8:06 PM
ResMar set Security to None.
ResMar added a comment.EditedJan 11 2016, 12:41 AM

I've clone the code and dug through it a bit and figured out the bug. At line 109 within the article_views definition:

outputDays = timestamps_between(startDate, endDate, timedelta(days=1))
output = defaultdict(dict, {
    day : {a : None for a in articles} for day in outputDays
})

This code is redundant and in the case of escaped characters populates the dict with incorrectly formatted entries. For example we want GNU/Linux_naming_controversy but we get instead GNU%2FLinux_naming_controversy.

Later on we get:

defaultdict(<class 'dict'>, {datetime.datetime(2015, 12, 27, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 157}, datetime.datetime(2015, 12, 28, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 175}, datetime.datetime(2015, 12, 25, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 128}, datetime.datetime(2015, 12, 26, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 146}, datetime.datetime(2015, 12, 12, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 169}, datetime.datetime(2015, 12, 15, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 211}, datetime.datetime(2015, 12, 11, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 203}, datetime.datetime(2016, 1, 4, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 310}, datetime.datetime(2016, 1, 3, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 174}, datetime.datetime(2015, 12, 16, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 232}, datetime.datetime(2015, 12, 21, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 187}, datetime.datetime(2015, 12, 31, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 184}, datetime.datetime(2015, 12, 19, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 162}, datetime.datetime(2015, 12, 29, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 176}, datetime.datetime(2015, 12, 18, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 176}, datetime.datetime(2015, 12, 30, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 193}, datetime.datetime(2016, 1, 6, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 416}, datetime.datetime(2016, 1, 8, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 301}, datetime.datetime(2016, 1, 1, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 193}, datetime.datetime(2015, 12, 24, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 142}, datetime.datetime(2015, 12, 20, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 227}, datetime.datetime(2016, 1, 2, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 161}, datetime.datetime(2015, 12, 17, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 196}, datetime.datetime(2015, 12, 14, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 209}, datetime.datetime(2015, 12, 23, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 152}, datetime.datetime(2016, 1, 5, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 560}, datetime.datetime(2016, 1, 7, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 313}, datetime.datetime(2015, 12, 22, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 206}, datetime.datetime(2016, 1, 10, 0, 0): {'GNU%2FLinux_naming_controversy': None}, datetime.datetime(2016, 1, 9, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 223}, datetime.datetime(2015, 12, 13, 0, 0): {'GNU%2FLinux_naming_controversy': None, 'GNU/Linux_naming_controversy': 184}})

I removed the initialization entirely. This is the submitted fix:

output = defaultdict(dict)

@Milimetric Can you push this change? It doesn't look like I can create pull requests against the GitHub codebase.

Hi @ResMar: Please associate at least one project with this task, otherwise nobody can find this task when searching in the corresponding project(s). Thanks.

@ResMar, I'm happy to merge a fix, but removing the initiation misses an important case, which is when the API doesn't return data for every day in the range we query for. This client attempts to normalize the output, so that initialization is important. It seems like the issue is simple though, if that fix makes the problem go away, I'll try to dig into it after I catch up with my now monstrous backlog :)

Milimetric added a project: Analytics.
ResMar added a comment.EditedJan 12 2016, 2:16 AM

@Aklapper Apologies, being unfamiliar with the organization of Phabricator tasks (well, with what place this toolset occupies in it) I was unsure of where to place this bug. Thank you!

@Milimetric Ok, got it. I cloned this repository locally and used my fix without realizing that it was causing an error in my application that was caused by exactly the case that you just described, but which I did not catch until just now. My suggested fix silently broke my own code! Not good.

The other fix that I had suggested but withdrew in an edit was to to use urllib.parse.unquote():

day : {urllib.parse.unquote(a) : None for a in articles} for day in outputDays

I can verify that this fix really did correct my issue locally. However, I believe that this codebase is meant to be version-agnostic, and unquote was moved into the parse sub-module only in 3. I've never written version-agnostic code before so I don't know what the precise fix in that case is, though.

Still, I hope that is helpful.

Milimetric moved this task from Incoming to temporary on the Analytics board.Jan 12 2016, 7:27 PM
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:36 PM
Milimetric moved this task from Incoming to temporary on the Analytics board.
Milimetric moved this task from temporary to Incoming on the Analytics board.Jan 12 2016, 7:43 PM

Thanks @ResMar, getting to these issues will take a bit, as I have a bunch of high priority work, but I appreciate the code and thoughts.

Milimetric closed this task as Resolved.Feb 29 2016, 5:30 PM
ResMar removed a subscriber: ResMar.Apr 15 2016, 1:44 PM