Page MenuHomePhabricator

Media: Handle HTML entities in plaintext descriptions
Closed, DeclinedPublic

Description

Looks like we'll need to decode entities prior to the striptags pass in generating the plaintext file descriptions.


Report (https://github.com/wikimedia/wikipedia-ios/pull/2310#issuecomment-391008942):

@mdholloway I ran across this description while testing:

&lt;a href=\"<a rel=\"nofollow\" class=\"external free\" href=\"http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998\">http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998</a>\" rel=\"nofollow\"&gt;Queensland State Archives Digital Image ID 7998&lt;/a&gt;

via https://en.wikipedia.org/api/rest_v1/page/media/Diana,_Princess_of_Wales

Then I tried:

> striptags("&lt;a href=\"<a rel=\"nofollow\" class=\"external free\" href=\"http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998\">http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998</a>\" rel=\"nofollow\"&gt;Queensland State Archives Digital Image ID 7998&lt;/a&gt;")

and got

&lt;a href="http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998" rel="nofollow"&gt;Queensland State Archives Digital Image ID 7998&lt;/a&gt;

I don't think we can do much about them explicitly encoding HTML tags into the plaintext description using entities (multiple passes of striptags?), but might make sense to at least expand the HTML entities to get <a href="http://www.archivessearch.qld.gov.au/Image/DigitalImageDetails.aspx?ImageId=7998" rel="nofollow">Queensland State Archives Digital Image ID 7998</a>

Event Timeline

Change 436281 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Media: Decode HTML entities in commons metadata before stripping tags

https://gerrit.wikimedia.org/r/436281

We discussed and decided this looks like an editor error, and we shouldn't implement anything for it.

Change 436281 abandoned by Mholloway:
Media: Decode HTML entities in commons metadata before stripping tags

Reason:
We discussed and decided this looks like an editor error, and we shouldn't implement anything for it.

https://gerrit.wikimedia.org/r/436281

Vvjjkkii renamed this task from Media: Handle HTML entities in plaintext descriptions to 62baaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Mholloway as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 62baaaaaaa to Media: Handle HTML entities in plaintext descriptions.Jul 2 2018, 3:26 PM
CommunityTechBot closed this task as Declined.
CommunityTechBot assigned this task to Mholloway.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.