Page MenuHomePhabricator

Media links with underscores in the URL are dirty-diffed (with no underscores)
Open, MediumPublic

Event Timeline

ssastry triaged this task as High priority.Nov 25 2019, 9:46 PM
ssastry created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 25 2019, 9:46 PM
cscott added a subscriber: cscott.Nov 25 2019, 9:57 PM

officewiki is running 1.35.0-wmf.5 (a473ba0) and VisualEditor 0.1.1 (26ebdd0) 14:26, 5 November 2019.

The revert for T237040 was cherry picked to branch wmf/1.35.0-wmf.2 as commit 87ef3e53e533c9565d226e6a48ed70e673d636d1: https://gerrit.wikimedia.org/r/543956

So this issue shouldn't be caused by T237040, AFAICT.

ssastry lowered the priority of this task from High to Medium.Nov 25 2019, 10:04 PM
ssastry removed a project: Parsoid-PHP.
ssastry renamed this task from Media links dirty diffed on officewiki to Media links with underscores in the URL are dirty-diffed (with no underscoes).Nov 25 2019, 10:04 PM
ssastry renamed this task from Media links with underscores in the URL are dirty-diffed (with no underscoes) to Media links with underscores in the URL are dirty-diffed (with no underscores).

Seems to be present on both enwiki and officewiki, so not a Parsoid/PHP issue.

Cf: https://en.wikipedia.org/w/index.php?title=User:Cscott/T237040&type=revision&diff=927959666&oldid=927959637&diffmode=source

But doesn't occur in straight wt2wt:

$ echo '[[Media:CBQ_RPO_1938.jpg|caption]]' | bin/parse.js --wt2wt
[[Media:CBQ_RPO_1938.jpg|caption]]

VE sends HTML like this back to Parsoid:

<body id=\"mwAA\" class=\"mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output\" dir=\"ltr\" lang=\"en\"><p id=\"mwAg\"><a href=\"./Media:CBQ_RPO_1938.jpg\" rel=\"mw:WikiLink\" resource=\"./Media:CBQ_RPO_1938.jpg\" title=\"CBQ RPO 1938.jpg\" id=\"mwAw\">This is a caption</a></p>
<p id=\"mwBA\">xyz</p></body>

while Parsoid's wt2wt has HTML like this at the midpoint:

<p data-parsoid='{"dsr":[0,34,0,0]}'><a rel="mw:MediaLink" href="//upload.wikimedia.org/wikipedia/en/f/fb/CBQ_RPO_1938.jpg" resource="./Media:CBQ_RPO_1938.jpg" title="CBQ RPO 1938.jpg" data-parsoid='{"a":{"resource":"./Media:CBQ_RPO_1938.jpg"},"sa":{"resource":"Media:CBQ_RPO_1938.jpg"},"dsr":[0,34,null,null]}'>caption</a></p>

Not clear where the spaces are coming from in the title; they aren't present in the href, resource or data-parsoid. Only place we have spaces is the title attribute.

Daimona added a subscriber: Daimona.Dec 5 2019, 6:32 PM

I don't know whether it makes any difference, but I'd like to point out that this also happens for media links inside <gallery> tags, see example.

I don't know whether it makes any difference, but I'd like to point out that this also happens for media links inside <gallery> tags, see example.

They generally do share serialization code but the dirtying there is coming from T214649, since the gallery presumably wasn't edited in that case. There's also T211895 / T151367 to deal with other normalizations in galleries.