Page MenuHomePhabricator

ParsoidBatchAPI doesn't normalize file titles
Closed, ResolvedPublic

Description

ParsoidBatchAPI fails when input filenames are not properly normalized. It reports the file as missing. For example, when a space is included between the namespace and dbkey as reported by subbu:

Rerunning visual diffing identified output diffs on [[Image: Flag of the British Army.svg|centre|180px]] with / without the batching api.

[subbu@earth api] echo '[[Image: Flag of the British Army.svg|centre|180px]]' | node parse --useBatchAPI
...
<img resource="./File:_Flag_of_the_British_Army.svg" src="./Special:FilePath/_Flag_of_the_British_Army.svg" height="180" width="180" data-parsoid='{"a":{"resource":"./File:_Flag_of_the_British_Army.svg","height":"180","width":"180"},"sa":{"resource":"Image: Flag of the British Army.svg"}}'/> 
...
[subbu@earth api] echo '[[Image: Flag of the British Army.svg|centre|180px]]' | node parse 
...
<img resource="./File:_Flag_of_the_British_Army.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/27/Flag_of_the_British_Army.svg/180px-Flag_of_the_British_Army.svg.png" data-file-width="675" data-file-height="450" data-file-type="drawing" height="120" width="180" data-parsoid='{"a":{"resource":"./File:_Flag_of_the_British_Army.svg","height":"120","width":"180"},"sa":{"resource":"Image: Flag of the British Army.svg"}} 
...

The rendering from non-batching API output matches the rendering from the output of the PHP parser on https://en.wikipedia.org/wiki/Royal_Engineers

Also '[[File:Panama National Anthem.ogg]]' on enwiki:Panama and '[[File: Arizona Fleming Elem FBISD.JPG|thumb|left|upright=1.7|alt=Photo of a school building with the lettering "Arizona Fleming Elementary".|Arizona Fleming Elementary School in Fort Bend ISD]]' on enwiki:Arizona_Fleming

Event Timeline

tstarling raised the priority of this task from to Needs Triage.
tstarling updated the task description. (Show Details)
tstarling added subscribers: tstarling, ssastry.

Change 240017 had a related patch set uploaded (by Tim Starling):
Normalize filenames in imageinfo

https://gerrit.wikimedia.org/r/240017

Change 240017 merged by jenkins-bot:
Normalize filenames in imageinfo

https://gerrit.wikimedia.org/r/240017

[[File:Panama National Anthem.ogg]] is an example of T112045 being fixed. Previously the audio file was linked in the img src attribute, which is not right.

The Arizona test case is another poorly normalized file title.

ssastry assigned this task to tstarling.
ssastry triaged this task as Medium priority.
ssastry removed a project: Patch-For-Review.
ssastry removed a subscriber: gerritbot.