RESTBase doesn't check title normalization for wiktionary
Closed, ResolvedPublic

Description

In Mediawiki there's a canonical way of representing a page title in the URI, for example http://en.wikipedia.org/wiki/cat will immediately redirect you to https://en.wikipedia.org/wiki/Cat (note the capital letter.

We cache the responses by URI with a pretty high TTL and then purge, so we require clients to request the content by the canonical title for several purposes:

  • To prevent cache fragmentation and increase hit-ratio
  • To avoid storing the content in Cassandra under non-canonical titles creating 'dead' content in storage that will probably never be served again or deleted
  • Most importantly, to definitively know which exact URL to purge when the content has changed.

The exact rules of converting arbitrary article title to a canonical representation are very complex, but there's a mediawiki-title library that takes care of that.

In RESTBase we have a filter which verifies that the title is correctly formatted and if it's not, 301 redirects the client to the proper location.

This filter is enabled for all routes, but it is basically a no-op if there's no {title} parameter in the path. However, we have a definitions endpoint where the {term} parameter is essentially the page title. So we need to change the filter to support the term parameter as well. I guess we could just fallback to it, but better we'd provide title_param_name parameter into the options (defaulting to just title) use it in the filter and register the filter again for wiktionary.

Also important that we've never actually used the title validation library for Wiktionary. I believe it will work fine, but we at least need to add some automatic tests for various Wiktionary titles in the validation library. As a minimum, we need to check that it doesn't capitalize or de-capitalize any letters cause wiktionary titles are case-sensitive. Also, Wiktionary has some abnormal namespaces that doesn't exist on wikipedia, so we need to add a test using some of these non-normal namespaces.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2018, 10:10 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-05T20:58:22Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@5b8ad3c]: Update deps, removed sections table, T207904 T206048 T207324

Mentioned in SAL (#wikimedia-operations) [2018-11-05T21:10:37Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@5b8ad3c]: Update deps, removed sections table, T207904 T206048 T207324 (duration: 12m 15s)

Mentioned in SAL (#wikimedia-operations) [2018-11-05T21:14:12Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@5b8ad3c]: Update deps, removed sections table, T207904 T206048 T207324 take 2

Mentioned in SAL (#wikimedia-operations) [2018-11-05T21:23:30Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@5b8ad3c]: Update deps, removed sections table, T207904 T206048 T207324 take 2 (duration: 09m 18s)

Pchelolo closed this task as Resolved.Nov 5 2018, 10:30 PM

Fix has been deployed