Page MenuHomePhabricator

Non-Normalised titles in mobile apps
Closed, ResolvedPublic

Description

To avoid duplication in storage and cache fragmentation we've recently enabled title normalisation in RESTBase. The normalisation completely repeats the process happening in MediaWiki. Incoming requests with non-normalised titles are redirected to normalised versions which creates additional latency for the clients.

A normalised title uses canonical localised namespace name and title in the dbKey format with underscores instead of slashes. More details in T127144 and https://github.com/wikimedia/mediawiki-title

However, looking at the logs, android app and mobile content service use non-normalised title versions some times. Here's the list of what I could find:

  • Use of non-localised namespace. Example Wikipedia:Página_principal normalised to Wikipédia:Página_principal (note the stress above e). On pt.wikipedia.org, user agent WikipediaApp/2.1.142-beta-2016-03-07 (Android 4.4.4; Phone)
  • Use of fragments in the URI. Example List_of_minor_planets:_257001–258000#201 - note the #201 part. Although it's completely legal, for RESTBase it doesn't matter at all whether you're using a fragment or not, returned content is the same. So it's better to strip it out right away. User agent: WMF Mobile Content Service
  • Non-trimmed whitespaces. Example: WSXGA_ -> WSXGA. User-agent: WMF Mobile Content Service

I will monitor the logs for some time and will possibly append to this list. However, this is not a major problem, the rate of incorrect requests is fairly low, and even an erroneous request is not a big problem as it's just a redirect.

Issues #2 and #3 will get resolved automatically when we switch on actually redirecting in RESTBase as backend services will only get pre-normalised titles. However for #1 is still not clear, where does Android app get this title.

Event Timeline

Change 276549 had a related patch set uploaded (by Mholloway):
Trim URI fragments/whitespace from titles before making requests

https://gerrit.wikimedia.org/r/276549

Is there actually a need to handle any of this in the mobileapps service? Won't RESTBase take care of title normalizations already?

My thinking was that a little cleanup up front could only help, but I'm happy to abandon the patch if it's unuseful.

Change 276549 abandoned by Mholloway:
Trim URI fragments/whitespace from titles before making requests

Reason:
per IRC discussion

https://gerrit.wikimedia.org/r/276549

@Pchelolo is #1 appearing for other languages as well? If so, it might be worth changing our main page title generation script to pull from siteinfo rather than allmessages.

@Mholloway Actually I don't know. I've seen only this case in the logs, but that does't prove anything.

@Mholloway: Using siteinfo should ensure consistency, as the information exposed there is actually what is considered the default namespace / main page by MediaWiki, in normalized form.

Change 276629 had a related patch set uploaded (by Mholloway):
Use siteinfo rather than allmessages API call to get wiki Main Page names

https://gerrit.wikimedia.org/r/276629

Change 276629 merged by jenkins-bot:
Use siteinfo rather than allmessages API call to get wiki Main Page names

https://gerrit.wikimedia.org/r/276629

@Pchelolo, I think this is resolved with the latest patch. OK to close the task?

Pchelolo claimed this task.

@Mholloway I suppose so. We will be monitoring the logs and open another issue if some new problem is discovered.