Page MenuHomePhabricator

Decode entities when following redirects in the mobile content service
Closed, ResolvedPublic

Description

We are using HTML dumps based on https://github.com/wikimedia/htmldumper to test RESTBase changes in staging. While testing this, we encountered several invalid titles that contained the substring '&':

  • Super8_&_Tab
  • Lilo_&_Stitch:_The_Series

Those titles were returned by the allpages API end point. This could either be a broken encoding pass in the Action API, or a corruption in the page table.

Similar titles are apparently requested from the RESTBaseUpdateJob, which might point towards this being a bug in a common title escape method.

To find examples, search for the phrase "title-invalid-characters" in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 31 2016, 10:22 PM
GWicke updated the task description. (Show Details)Mar 31 2016, 10:23 PM
Anomie added a subscriber: Anomie.Apr 1 2016, 6:28 AM

Those titles were returned by the allpages API end point. This could either be a broken encoding pass in the Action API, or a corruption in the page table.

Actual URLs for that would be helpful. I'm unable to find action API queries that return titles containing "&"; this and this don't do it, for example.

I don't see any information about action API requests in your kibana board.

On further investigation, it turns out that these encoded titles were actually requested from the mobile app content service, and *not* from the HTML dump script or the job queue. This means that the original suspicion that there could be an encoding issue in the API or page table was unfounded. @Anomie, my apologies for jumping to conclusions.

The issue is that the mobile app service manually follows redirects, but neglects to decode HTML entities. This then results in invalid titles being requested whenever the redirect href contains entities.

@Mholloway, @bearND: It would be good if you could make sure that entities are decoded before following the href, for example using the entities package.

GWicke renamed this task from Investigate source of invalid titles containing & returned by allpages to Decode entities when following redirects in the mobile content service.Apr 1 2016, 7:02 PM
GWicke edited projects, added Mobile-Content-Service; removed MediaWiki-API.
bearND claimed this task.Apr 4 2016, 4:05 PM
bearND triaged this task as High priority.
bearND moved this task from Incoming to Doing on the Mobile-Content-Service board.
bearND moved this task from To Do to Doing on the Mobile-App-Android-Sprint-79-Gold board.

Change 281588 had a related patch set uploaded (by BearND):
Decode entities when following redirects

https://gerrit.wikimedia.org/r/281588

bearND moved this task from Doing to Code Review on the Mobile-Content-Service board.

Change 281588 merged by Ppchelko:
Decode entities when following redirects

https://gerrit.wikimedia.org/r/281588

Dbrant closed this task as Resolved.Apr 11 2016, 3:35 PM