Page MenuHomePhabricator

EventBus produces non-canonical page urls
Closed, ResolvedPublic

Description

Code at https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/6721d4709db9d39664bfc77b092a18db05c66bb3/EventBus.php#L145 is producing urls like this:

https://en.wikipedia.org/wiki/Template_talk%3ACycling_data_CJP
https://pt.wikipedia.org/wiki/Lista_do_desempenho_de_independentes_em_elei%C3%A7%C3%B5es_de_Portugal
https://www.wikidata.org/wiki/Q28177126
https://en.wikipedia.org/wiki/Talk%3ACaillou_Pettis

<link rel="canonical" href="https://en.wikipedia.org/wiki/Talk:Caillou_Pettis"/>

This should be using one of Title's get*URL methods (such as getCanonicalURL) or wfUrlencode() directly so that colons and various other characters are not escaped.

The EventBus::getArticleURL explains this workaround through something about RESTBase not expecting slashes, but that doesn't seem relevant to a url property that contains the link to where the page can be viewed.

Event Timeline

Krinkle created this task.Jan 10 2017, 10:42 PM
Restricted Application added a project: Analytics. · View Herald TranscriptJan 10 2017, 10:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo added a project: Services (doing).
Pchelolo added a subscriber: Pchelolo.

The EventBus::getArticleURL explains this workaround through something about RESTBase not expecting slashes, but that doesn't seem relevant to a url property that contains the link to where the page can be viewed.

The idea here is that in RESTBase and services infrastructure we're using the default JS encodeURIComponent and decodeURIComponent which is the same as the php rawurlencode function. If we used the wfUrlencode we'd need to reimplement at least it's decoding part in JS and reencode the titles before sending them further into the services workflow. This is totally possible to do, but since these URIs are not really used right now, we didn't implement it in that way, so I'd say it's tech debt that's not affecting anything right now.

However, if we'd ever wanted to use these events to do something more MW-specific, like purge Varnish, we need to implement it in a proper way. I'd take this task and fix it up since the question was raised.

Removing the Analytics tag since it's not related to analytics.

Restricted Application added a project: Analytics. · View Herald TranscriptJan 17 2017, 11:56 PM

Change 332873 had a related patch set uploaded (by Ppchelko):
Encode titles in event URIs complient to MW rules.

https://gerrit.wikimedia.org/r/332873

I've created a set of interdependent patches to solve this:

After all of them are merged/deployed this could be resolved.

Change 332873 merged by jenkins-bot:
Encode titles in event URIs complient to MW rules.

https://gerrit.wikimedia.org/r/332873

Nuria moved this task from Incoming to Radar on the Analytics board.Jan 23 2017, 5:08 PM

Moving to done pending deployment.

I've created a set of interdependent patches to solve this:

After all of them are merged/deployed this could be resolved.

Thanks! By the way, I did not expect this change in the EventBus extension to depend on the commits in ChangeProp. This issue related only to the url field, not any title field. I guess that ChangeProp uses url to unique identify a page object? (as opposed to a more raw tuple of wiki domain + unescaped page title)

EDIT: Aha, it uses '/https:\/\/[^\/]+\/wiki\/(?<title>.+)/' to extract title from the url. (src)

Pchelolo closed this task as Resolved.Jan 27 2017, 10:01 PM

The MW change's been deployed - resolving.