Page MenuHomePhabricator

Thumbnail generator should reject invalid "lang" parameters
Open, LowPublic

Description

To reduce variations and cache pollution, the lang parameter for SVG thumbnails, should reject invalid language codes.

E.g. https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Example.svg/langfoobar-135px-Example.svg.png

should probably respond with HTTP 4xx.

And to avoid breakage for existing content and if language codes change/removed, the parser should normalise the lang-parameter passed to [[File:]] so that lang=foobar results in the default (no lang override) instead of a 404 error.

Event Timeline

Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, brion.

In the original implementation, it didn't do this as the code that detects what languages are valid in a file fails to recognize some valid languages in a few edge cases (If i remember correctly)

How about redirecting instead? I.e. if you request an invalid language, then it just redirects to the same thumbnail without a lang parameter.

Reading the code, it seems like the explanation is that systemLanguage attributes can appear anywhere in the file, and to avoid OOMs, the SVG metadata parser only parses up to a certain point ($wgSVGMetadataCutoff). It's basically betting that if a language is going to show up in the SVG, it's going to show up early. But there is the distinct possibility that a given language would only appear past the cutoff point.

It seems like from PHP's point of view, we can't assess with certainty that the requested lang parameter isn't one that appears in a systemLanguage attribute somewhere in the SVG file. Solving this issue requires reconsidering the way we parse SVGs to get that information.

I'm a bit surprised by the OOM justification for the cutoff, since XMLReader is a streaming XML reader. It shouldn't matter what the file size is for such a reader.

The history for the cutoff: https://static-bugzilla.wikimedia.org/bug27508.html

It's about execution time, not OOM. Makes a lot more sense.

It seems to me like the real flaw here is that the metadata is extracted a bunch of times (at the very least every time you request a new thumbnail size), instead of once per original and stored somewhere. Fixing that is probably a big undertaking, though. A big effort for a very minimal gain, especially since we don't know the extent of the waste caused by allowing any lang parameter. Maybe it's not that big of a problem in practice.

Normally, metadata is stored in img_metadata as a serialized PHP array and it's only updated when a new version is uploaded or the file is purged. Does that work differently for SVG?

Normally, metadata is stored in img_metadata as a serialized PHP array and it's only updated when a new version is uploaded or the file is purged. Does that work differently for SVG?

metadata is only extracted once (Or after cache gets cleared, which happens on some events).

Arguably the issue is that the metadata cache gets initiated the first time its needed, which will either be on an image scalar or during a web request, as opposed to at some time not dependent on the request.

Keep in mind, this is commons, someone will probably upload a 1 gb svg file at some point. I think its important to have some cut off for sanity.