Page MenuHomePhabricator

Linter fails to detect multiple "upright" parameters as a Bogus file option
Closed, ResolvedPublic

Description

Linter fails to detect duplicate upright parameters like this:

[[File:Temppeliaukio_Church_Interior.JPG|thumb|upright|right|upright=1.15|The [[Temppeliaukio Church]] in Finland is carved out of a single rock.]]

It should see that there are two "upright" parameters and flag them as a Bogus file option.

Event Timeline

Possibly related: <code>|upright =1.5|</code> (with a space between "upright" and the equals sign) is ignored (the image renders as upright=1.0), but is not detected as a bogus option.

A variant on this bug:

[[File:Temppeliaukio_Church_Interior.JPG|thumb|right|upright=2|upright|The [[Temppeliaukio Church]] in Finland is carved out of a single rock.]]

In this example, the second "upright" is marked as a Linter bogus file option, but it is actually the "upright=2" that is being ignored. The Linter error is correct, but the options are being processed incorrectly (the first valid option should be processed in order to be consistent with handling of the other named parameters like left/right and thumb/frame).

I also documented a bunch of edge cases, some of which behave differently based on which parameter is being processed, at https://www.mediawiki.org/wiki/Help:Images. It seems like someone might need to go back to the spec for the File display options and make sure that the code is written to the spec. (...or write a spec...) I keep finding odd little quirks.

ssastry triaged this task as Medium priority.Mar 11 2019, 5:08 PM
ssastry moved this task from Backlog to Parsoid on the MediaWiki-extensions-Linter board.

One of the notes I added to Help:Images: "If there is a space character between alt and the equals sign, the alt statement will be treated as a caption." This behavior is contrary to how editors are used to working with template parameters. Yes, I know that images are not templates, but it is difficult to manage this inconsistent behavior, and there does not appear to be a good design reason for treating these spaces as syntactically meaningful while (for example) stripping out spaces after the equals sign.

You can see this alt= versus alt = behavior here: https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox3&oldid=935129500

Here is another variant on this bug:

https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox3&oldid=939466925

The following image options are not detected as bogus by the Linter engine:

upright=2S
upright 0.75S

In all cases, a caption is present, so the invalid upright option without an equals sign should either be flagged as excess text or as incorrect syntax for the upright option.

Another variant on this bug: the following wikicode renders on en.WP with the "link" section as the caption, and Linter does not flag the real caption as extra text. That looks like two bugs in one image call.

[[File:NYC Bus with "masks required" sign.jpg|thumb|New York City transit bus. 177th Street near Devoe Ave, Bronx, NY. Route sign reads, "Masks Required"|link=File:NYC_Bus_with_%22masks_required%22_sign.jpg|alt=]]

This works fine:

[[File:NYC Bus with "masks required" sign.jpg|thumb|New York City transit bus. 177th Street near Devoe Ave, Bronx, NY. Route sign reads, "Masks Required"|link=File:NYC_Bus_with_"masks_required"_sign.jpg|alt=]]

Here's another upright option that is not detected as bogus by Linter:

[[File:Liloy National High School ESF Building.jpg|thumb|upright=1.5"|right|The Liloy National High School 2-story ESF Building.]]

Note the quote mark after the upright value.

It appears that the first error (duplicate upright) is detected now after some Linter bug patches, but the second error (a quote mark following the number) is still not detected. See this test case:

https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox&oldid=1010141113

And another variant found in the wild:
https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox&oldid=1011570863

[[Image:gray437.png|thumb|upright=0.7px|Human lower leg anatomy]]

Change 777474 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Lint invalid upright options

https://gerrit.wikimedia.org/r/777474

Change 777474 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Lint invalid upright options

https://gerrit.wikimedia.org/r/777474

One of the notes I added to Help:Images: "If there is a space character between alt and the equals sign, the alt statement will be treated as a caption." This behavior is contrary to how editors are used to working with template parameters. Yes, I know that images are not templates, but it is difficult to manage this inconsistent behavior, and there does not appear to be a good design reason for treating these spaces as syntactically meaningful while (for example) stripping out spaces after the equals sign.

You can see this alt= versus alt = behavior here: https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox3&oldid=935129500

This comes from pattern matching,

{
                "name": "img_alt",
                "aliases": [
                    "alt=$1"
                ],
                "case-sensitive": ""
            },

https://github.com/wikimedia/parsoid/blob/master/baseconfig/enwiki.json#L1158

Another variant on this bug: the following wikicode renders on en.WP with the "link" section as the caption, and Linter does not flag the real caption as extra text. That looks like two bugs in one image call.

[[File:NYC Bus with "masks required" sign.jpg|thumb|New York City transit bus. 177th Street near Devoe Ave, Bronx, NY. Route sign reads, "Masks Required"|link=File:NYC_Bus_with_%22masks_required%22_sign.jpg|alt=]]

This works fine:

[[File:NYC Bus with "masks required" sign.jpg|thumb|New York City transit bus. 177th Street near Devoe Ave, Bronx, NY. Route sign reads, "Masks Required"|link=File:NYC_Bus_with_"masks_required"_sign.jpg|alt=]]

Parsoid doesn't detect this because it renders them identically. % is an invalid title char and I suppose Parsoid is (incorrectly?) doing percent decoding of the string before trying to construct a Title object with it.

It appears that the first error (duplicate upright) is detected now after some Linter bug patches, but the second error (a quote mark following the number) is still not detected. See this test case:

https://en.wikipedia.org/w/index.php?title=User:Jonesey95/sandbox&oldid=1010141113

Should be fixed with T216003#7834043

Another variant on this bug: the following wikicode renders on en.WP with the "link" section as the caption, and Linter does not flag the real caption as extra text. That looks like two bugs in one image call.

[[File:NYC Bus with "masks required" sign.jpg|thumb|New York City transit bus. 177th Street near Devoe Ave, Bronx, NY. Route sign reads, "Masks Required"|link=File:NYC_Bus_with_%22masks_required%22_sign.jpg|alt=]]

Parsoid doesn't detect this because it renders them identically. % is an invalid title char and I suppose Parsoid is (incorrectly?) doing percent decoding of the string before trying to construct a Title object with it.

If that is the case, a % or other invalid character in link= should be reported as a Linter error. It should not just fail silently.

If that is the case, a % or other invalid character in link= should be reported as a Linter error. It should not just fail silently.

Maybe. It could be that the legacy parser is being inconsistent and it some cases it does percent decoding and, in this, it isn't. I need to investigate it further and see if it's better to just fix that. There is a test I know of with link=< that both parsers handle the same way. So, it isn't all invalid characters.

For example, [[:File:NYC_Bus_with_%22masks_required%22_sign.jpg]] seems to work

Change 777870 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/core@master] Percent decode titles in media link option

https://gerrit.wikimedia.org/r/777870

A variant on this bug:

[[File:Temppeliaukio_Church_Interior.JPG|thumb|right|upright=2|upright|The [[Temppeliaukio Church]] in Finland is carved out of a single rock.]]

In this example, the second "upright" is marked as a Linter bogus file option, but it is actually the "upright=2" that is being ignored. The Linter error is correct, but the options are being processed incorrectly (the first valid option should be processed in order to be consistent with handling of the other named parameters like left/right and thumb/frame).

I also documented a bunch of edge cases, some of which behave differently based on which parameter is being processed, at https://www.mediawiki.org/wiki/Help:Images. It seems like someone might need to go back to the spec for the File display options and make sure that the code is written to the spec. (...or write a spec...) I keep finding odd little quirks.

This is a bit of a mess. So, the format (framed, thumbnail, etc) is first one wins. Dimensions (width, height) are last one wins. Horizontal and vertical alignment (left, right, etc) are first one wins. Caption is last one wins. The legacy parser and Parsoid agree on that. However, for any other media option, Parsoid is first one wins and the legacy parser is last one wins. This is where the discrepancy with upright originates.

https://github.com/wikimedia/mediawiki/blob/master/includes/parser/Parser.php#L5353-L5442

If both |upright| and |upright=number| are present, the first |upright| will be ignored. See task T216003 for details.

This could be clearer in the documentation. It could be confused that somehow specifying the =number will take precedent, when it's really just that the last upright definition is the one that's used.

If there is a space character between "upright" and the equals sign (e.g. |upright =number|), or if the "upright" value contains non-numeric characters, the upright value will be ignored. See task T216003 for details.

This is the same pattern matching confusion with |alt= in T216003#7836205. I agree that it's unfortunate that this doesn't have the same leniency as in template arguments.

{
    "name": "img_upright",
    "aliases": [
        "upright",
        "upright=$1",
        "upright $1"
    ],
    "case-sensitive": ""
},

Caption is last one wins, but also remember that "anything not recognized as a valid option is a caption". That behavior actually makes the syntax very hard to extend & very hard to deprecate no-longer-acception options without creating a bunch of inadvertent captions.

I think consistency is a virtue here, and it seems like the legacy parser is "most consistent" w/r/t a "last one wins" policy, across a number of different attribute types. So I'd support changing Parsoid to match that -- but I'd hope that we're also linting or tracking-categorying when this happens, because I'd *like* to say that behavior is "undefined" if you have duplicate options.

I think consistency is a virtue here, and it seems like the legacy parser is "most consistent" w/r/t a "last one wins" policy

I filed T305628 to make that change

but I'd hope that we're also linting or tracking-categorying when this happens

We've been linting bogus / duplicate media options for quite some time now

Change 779085 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a5

https://gerrit.wikimedia.org/r/779085

Change 779085 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a5

https://gerrit.wikimedia.org/r/779085

Change 777870 merged by jenkins-bot:

[mediawiki/core@master] Percent decode titles in media link option

https://gerrit.wikimedia.org/r/777870

This is the same pattern matching confusion with |alt= in T216003#7836205. I agree that it's unfortunate that this doesn't have the same leniency as in template arguments.

Filed as T306216

Maybe. It could be that the legacy parser is being inconsistent and it some cases it does percent decoding and, in this, it isn't. I need to investigate it further and see if it's better to just fix that. There is a test I know of with link=< that both parsers handle the same way. So, it isn't all invalid characters.

Addressed by T216003#7855217

Arlolra claimed this task.

I think everything here is now fixed or covered by another task.