Page MenuHomePhabricator

EventBus produces non-canonical page urls
Closed, ResolvedPublic

Description

Code at https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/6721d4709db9d39664bfc77b092a18db05c66bb3/EventBus.php#L145 is producing urls like this:

https://en.wikipedia.org/wiki/Template_talk%3ACycling_data_CJP
https://pt.wikipedia.org/wiki/Lista_do_desempenho_de_independentes_em_elei%C3%A7%C3%B5es_de_Portugal
https://www.wikidata.org/wiki/Q28177126
https://en.wikipedia.org/wiki/Talk%3ACaillou_Pettis

<link rel="canonical" href="https://en.wikipedia.org/wiki/Talk:Caillou_Pettis"/>

This should be using one of Title's get*URL methods (such as getCanonicalURL) or wfUrlencode() directly so that colons and various other characters are not escaped.

The EventBus::getArticleURL explains this workaround through something about RESTBase not expecting slashes, but that doesn't seem relevant to a url property that contains the link to where the page can be viewed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo added a project: Services (doing).
Pchelolo subscribed.

The EventBus::getArticleURL explains this workaround through something about RESTBase not expecting slashes, but that doesn't seem relevant to a url property that contains the link to where the page can be viewed.

The idea here is that in RESTBase and services infrastructure we're using the default JS encodeURIComponent and decodeURIComponent which is the same as the php rawurlencode function. If we used the wfUrlencode we'd need to reimplement at least it's decoding part in JS and reencode the titles before sending them further into the services workflow. This is totally possible to do, but since these URIs are not really used right now, we didn't implement it in that way, so I'd say it's tech debt that's not affecting anything right now.

However, if we'd ever wanted to use these events to do something more MW-specific, like purge Varnish, we need to implement it in a proper way. I'd take this task and fix it up since the question was raised.

Removing the Analytics tag since it's not related to analytics.

Change 332873 had a related patch set uploaded (by Ppchelko):
Encode titles in event URIs complient to MW rules.

https://gerrit.wikimedia.org/r/332873

I've created a set of interdependent patches to solve this:

After all of them are merged/deployed this could be resolved.

Change 332873 merged by jenkins-bot:
Encode titles in event URIs complient to MW rules.

https://gerrit.wikimedia.org/r/332873

I've created a set of interdependent patches to solve this:

After all of them are merged/deployed this could be resolved.

Thanks! By the way, I did not expect this change in the EventBus extension to depend on the commits in ChangeProp. This issue related only to the url field, not any title field. I guess that ChangeProp uses url to unique identify a page object? (as opposed to a more raw tuple of wiki domain + unescaped page title)

EDIT: Aha, it uses '/https:\/\/[^\/]+\/wiki\/(?<title>.+)/' to extract title from the url. (src)

The MW change's been deployed - resolving.