Quote marks in "alt" text break media attribute parsing
Open, NormalPublic

Description

The following wikitext:

Before
[[File:Translating the wiki way.webm|thumb|400px|thumbtime=16:00|Niklas Laxström explains the features that allowed [[translatewiki.net]] to provide MediaWiki with more than 300 locales.|alt=Niklas Laxström, ''Translating the wiki way: Simple, fast, fun'', Wikimania 2012]]

Renders visually as:

That is to say, the alt text is not the caption. Rather, the bit that doesn't start with any with equals sign, is the caption. When editing the above in VisualEditor, it looks fine at first:

But, when without changing anything, or when changing something unrelated elsewhere on the page, it will be serialised incorrectly. This caused an unsuspecting user to save a dirty diff (revision 851249986) that also unexpectedly inserted use of <nowiki>.

After
[[File:Translating the wiki way.webm|thumb|400px|thumbtime=16:00|Niklas Laxström explains the features that allowed [[translatewiki.net]] to provide MediaWiki with more than 300 locales.|alt=Niklas Laxström, <nowiki>''</nowiki>Translating the wiki way: Simple, fast, fun<nowiki>''</nowiki>, Wikimania 2012]]

In addition to the dirty diff, it also does not roundtrip. After the above edit, re-opening it in VisualEditor now mistakes the alt text (including the syntax alt= itself) as the caption:

After this second save, the page remains consistently damaged with the caption removed.

Krinkle created this task.Oct 14 2018, 2:32 AM
Restricted Application added subscribers: jeblad, Aklapper. · View Herald TranscriptOct 14 2018, 2:32 AM
cscott added a subscriber: cscott.Oct 15 2018, 1:02 PM

There's a misfeature in the PHP parser where "anything which doesn't properly parse as an option" is assumed to be a caption. Parsoid mimics this behavior, for better or for worse. (Predictably, see a new syntax proposal here.)

Anyway, that's just to say that the problem is most likely "alt option fails to parse"; the "interpreted as a caption" part is just the "expected" side-effect of any option failing to parse.

cscott added a comment.EditedOct 15 2018, 1:07 PM

And in particular it seems to be the newline between the double-quotes in the caption which is causing the alt attribute to fail.

$ (echo "[[File:Foo.jpg|caption|alt=''alt'']]") | bin/parse.js --normalize=parsoid

<p><figure-inline class="mw-default-size" typeof="mw:Image" data-mw='{"caption":"caption"}'><a href="./File:Foo.jpg"><img alt="alt" resource="./File:Foo.jpg" src="//upload.wikimedia.org/wikipedia/commons/0/06/Foo.jpg" data-file-width="300" data-file-height="197" data-file-type="bitmap" height="197" width="300"/></a></figure-inline></p>

vs

$ (echo "[[File:Foo.jpg|caption|alt=''"; echo "alt'']]") | bin/parse.js --normalize=parsoid

<p><figure-inline class="mw-default-size" typeof="mw:Image" data-mw='{"caption":"alt=&lt;i>\nalt&lt;/i>"}'><a href="./File:Foo.jpg"><img resource="./File:Foo.jpg" src="//upload.wikimedia.org/wikipedia/commons/0/06/Foo.jpg" data-file-width="300" data-file-height="197" data-file-type="bitmap" height="197" width="300"/></a></figure-inline></p>
cscott renamed this task from Media "alt" text wrongly interpreted as caption to Newlines in "alt" text between double-quotes breaks media attribute parsing.Oct 15 2018, 1:11 PM
cscott added a comment.EditedOct 15 2018, 1:19 PM

Hey, look at this:

$ (echo "[[File:Foo.jpg|caption|alt=''"; echo "alt'']]") | php maintenance/parse.php 
<p><a href="/~cananian/mediawiki/index.php/File:Foo.jpg" class="image" title="alt= alt"><img alt="alt= alt" src="/~cananian/mediawiki/images/0/06/Foo.jpg" width="400" height="267" /></a>
</p>

PHP doesn't handle this either. That is, it *is* an invalid alt, and *should* be treated as a caption AFAICT. However, it shouldn't break round-tripping...

(It appears correctly in https://en.wikipedia.org/w/index.php?title=MediaWiki&oldid=851249986 -- but this seems to be TimedMediaHandler's doing, not the core parser.)

cscott added a comment.EditedOct 15 2018, 1:29 PM

Hm, line breaks appear to be a red herring, see: https://en.wikipedia.org/w/index.php?title=User:Cscott/T206940&oldid=864158059

New diagnosis:

  1. html2wt bug fails to selser this
  2. html2wt bug tries to nowiki escape invalid markup in alt tag (this isn't going to work!)
  3. wt2html bug assumes attribute is caption if it contains <nowiki> markup, instead of just stripping it.

i'm working on a patch

Krinkle renamed this task from Newlines in "alt" text between double-quotes breaks media attribute parsing to Quote marks in "alt" text break media attribute parsing.Oct 15 2018, 5:22 PM

@cscott Hm.. so it's not the new lines. I assume that means its the quote marks that trigger it?

Well, a combination of things. It's triggered by the quote marks, but then we're not handling the <nowiki> in the serialized version either. [[File:Foo.jpg|alt=<nowiki>''alt''</nowiki>]] ought to be the "correct" way to get embedded single quotes into the alt value. It looks like we might have a similar issue with the link option as well. I've got a working patch for alt, working on understanding the link issue.

Turns out there's a bug in how core PHP parses [[File:Foo.jpg|link=Foo''s bar''s]] (which is a valid title) or [[File:Foo.jpg|link=''Main Page'']] (where the italics apparently should be stripped). So now I've got a patch for core as well as one for Parsoid...

Change 467529 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/core@master] Handle <nowiki> and other markup consistently in image link/alt options

https://gerrit.wikimedia.org/r/467529

Change 467531 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] WIP: Image alt and link options can contain arbitrary wikitext

https://gerrit.wikimedia.org/r/467531

This turned into a little rathole, but I've come out the other end fixing (a) how Parsoid parses alt/link options (wikitext markup including <nowiki> is allowed), (b) how Parsoid renders alt/link options (consistent stripping), (c) how core renders link options (<nowiki> expansion and stripping consistent with alt) , (d) how core handles ampersands in alt/link options (bug in remex), and (e) how Parsoid handles ampersands in alt/link options. Now we've just got to get those three patches merged, starting with the remex bug (T207088: Remex double-decodes HTML entities on PHP (not HHVM)) because the newly-added test cases won't pass on jenkins until remex is fixed and the fix is packaged and released so composer can get to it.

Change 467529 merged by jenkins-bot:
[mediawiki/core@master] Handle <nowiki> and other markup consistently in image link/alt options

https://gerrit.wikimedia.org/r/467529

Change 467531 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Image alt and link options can contain arbitrary wikitext which is stripped

https://gerrit.wikimedia.org/r/467531

Arlolra triaged this task as Normal priority.
Arlolra assigned this task to cscott.

Mentioned in SAL (#wikimedia-operations) [2018-11-07T22:02:07Z] <arlolra> Updated Parsoid to 970751a (T206940)

@cscott Although this is much improved and the "after" (from above) of the edit now renders as expected,

[[File:Translating the wiki way.webm|thumb|400px|thumbtime=16:00|Niklas Laxström explains the features that allowed [[translatewiki.net]] to provide MediaWiki with more than 300 locales.|alt=Niklas Laxström, <nowiki>''</nowiki>Translating the wiki way: Simple, fast, fun<nowiki>''</nowiki>, Wikimania 2012]]

with the alt attribute in the data-mw,

<figure typeof="mw:Video/Thumb" data-parsoid='{"optList":[{"ck":"thumbnail","ak":"thumb"},{"ck":"width","ak":"400px"},{"ck":"thumbtime","ak":"thumbtime=16:00"},{"ck":"caption","ak":"Niklas Laxström explains the features that allowed [[translatewiki.net]] to provide MediaWiki with more than 300 locales."},{"ck":"alt","ak":"alt=Niklas Laxström, &lt;nowiki>&apos;&apos;&lt;/nowiki>Translating the wiki way: Simple, fast, fun&lt;nowiki>&apos;&apos;&lt;/nowiki>, Wikimania 2012"}],"dsr":[0,307,2,2]}' data-mw='{"attribs":[["alt",{"html":"alt=Niklas Laxström, &lt;nowiki>&apos;&apos;&lt;/nowiki>Translating the wiki way: Simple, fast, fun&lt;nowiki>&apos;&apos;&lt;/nowiki>, Wikimania 2012"}]],"thumbtime":"16:00"}'><span data-parsoid='{"dsr":[2,null,null,null]}'><video poster="//upload.wikimedia.org/wikipedia/commons/thumb/3/37/Translating_the_wiki_way.webm/400px-seek%3D960-Translating_the_wiki_way.webm.jpg" controls="" preload="none" height="225" width="400" resource="./File:Translating_the_wiki_way.webm" data-parsoid='{"a":{"height":"225","width":"400","resource":"./File:Translating_the_wiki_way.webm"},"sa":{"resource":"File:Translating the wiki way.webm"}}'>
...
</video></span><figcaption data-parsoid='{"dsr":[null,305,null,null]}'>Niklas Laxström explains the features that allowed <a rel="mw:WikiLink" href="./Translatewiki.net" title="Translatewiki.net" data-parsoid='{"stx":"simple","a":{"href":"./Translatewiki.net"},"sa":{"href":"translatewiki.net"},"dsr":[116,137,2,2]}'>translatewiki.net</a> to provide MediaWiki with more than 300 locales.</figcaption></figure>

It unfortunately roundtrips as follows,

[[File:Translating the wiki way.webm|thumb|400px|thumbtime=16:00|Niklas Laxström explains the features that allowed [[translatewiki.net]] to provide MediaWiki with more than 300 locales.|alt=Niklas Laxström, &lt;nowiki&gt;<nowiki>''</nowiki>&lt;/nowiki&gt;Translating the wiki way: Simple, fast, fun&lt;nowiki&gt;<nowiki>''</nowiki>&lt;/nowiki&gt;, Wikimania 2012]]

The test cases you added were for images but the above is a video, which handles alt/link differently with silentOptions().

Did you check the other issues from https://phabricator.wikimedia.org/T206940#4670526 ? Sounds like you're saying that wt2html is fixed (both PHP and Parsoid agree and do something sensible) but that html2wt for video specifically is still broken since it treats the (invisible) embedded alt attribute as plaintext rather than HTML. Or maybe html2wt is alright but we shouldn't be embedding the invisible alt as HTML but should be doing the same tag-stripping that we would do for a "real" alt attribute. (I think I'm a little partial to this latter.)

Did you check the other issues from https://phabricator.wikimedia.org/T206940#4670526 ?

Tangentially, I was mainly verifying that we fixed the original issue that was filed.

Sounds like you're saying that wt2html is fixed (both PHP and Parsoid agree and do something sensible) but that html2wt for video specifically is still broken since it treats the (invisible) embedded alt attribute as plaintext rather than HTML.

The attribute that we're stuffing in data-mw claims to be html but it looks to be unparsed (at least the nowikis aren't rendered as spans),

["alt",{"html":"alt=Niklas Laxström, &lt;nowiki>&apos;&apos;&lt;/nowiki>Translating the wiki way: Simple, fast, fun&lt;nowiki>&apos;&apos;&lt;/nowiki>, Wikimania 2012"}]

Or maybe html2wt is alright but we shouldn't be embedding the invisible alt as HTML but should be doing the same tag-stripping that we would do for a "real" alt attribute. (I think I'm a little partial to this latter.)

Seems reasonable.