Page MenuHomePhabricator

MediaWiki 1.36/wmf.13 needlessly HTML encodes ASCII characters in DjVu text layer
Closed, ResolvedPublic

Description

Since the deployment of MediaWiki 1.36/wmf.13, text extracted from the text layer of DjVu files is subjected to HTML entity encoding for even some basic ASCII characters. This shows up to users as   (space), ' (single quote), and 
 (newline) in the ProofreadPage page editing interface in the Page: namespace. example page

There should generally be no HTML entity encoding visible here: the displayed text in the WikiEditor should be very nearly the raw content of the file's text layer.

So far as I know the provenance here is that when a file is uploaded its text layer is extracted by MediaWiki-DjVu and stuffed into some field of the image metadata in the database, lightly wrapped in an XML structure (see T192866). When a user opens a (redlink) wikipage in the Page: namespace, ProofreadPage extracts that field, removes the XML, and nukes a few select character codes (CR, LF, VS, PB: control characters in this context, not obviously relevant to this issue. See T230415.), before preloading the text into the WikiEditor text field.

In one of these steps it seems like a change has introduced HTML entity encoding that was not there before.

The obvious suspects here are MediaWiki-DjVu, ProofreadPage, and WikiEditor; with greatest probability being MediaWiki-DjVu. I'm not aware of changes to either of these in this release, and I've found nothing in the release notes. I know some work has been done on using Remex for character entity reference support and validation in MW over that last two-ish years, but I'm not aware of any recent such changes (and, again, nothing in the release notes). No clue who owns remex and its uses. Parser? VE?

(CC to @Soda who has been working on PRP recently and might know off-hand whether there has been any relevant changes to the code there)

Event Timeline

Xover created this task.Thu, Oct 15, 7:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Oct 15, 7:37 AM
Mpaa added a subscriber: Mpaa.Thu, Oct 15, 10:02 AM

Looking at the whole metadata via API, they are fine (search "name": 307) in https://en.wikisource.org/w/api.php?action=query&titles=File:The%20Victoria%20History%20of%20the%20County%20of%20Lincoln%20Volume%202.pdf&prop=imageinfo&iiprop=metadata

Instead, when fetched as a single page (https://en.wikisource.org//w/api.php?titles=Page:The+Victoria+History+of+the+County+of+Lincoln+Volume+2.pdf/308&inprop=preload&prop=info&action=query) and transformed in "contentmodel": "proofread-page" they are corrupted.

Problem is visible also at API level, WikiEditor seems unlikely, I guess.

Hmm. If one API endpoint returns unencoded text, then lower-level components seem unlikely culprits. If one API returns encoded text, then higher-level components seem unlikely. IOW: those results suggest to me that this is happening at the API layer.

There've been some pushes to sanitize and clean up the various API endpoints and their outputs recently (fsvo "recent"). Could this be a side-effect of an attempt to "fix" API output that's also used in a different context?

Change 634105 had a related patch set uploaded (by Tpt; owner: Tpt):
[mediawiki/extensions/ProofreadPage@master] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634105

Tpt added a comment.Thu, Oct 15, 2:37 PM

Probably caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/628753

This seems to be indeed the problem cause. I have submitted a revert for review: https://gerrit.wikimedia.org/r/634105

I'm happy if there is another way to fix this. But note that it's not as easy as the commit message in https://gerrit.wikimedia.org/r/634105 makes it sound like. T263371 was about an actual XSS attack vector. I might be wrong, but I believe this means we should apply an alternative fix first, before reverting the previous one.

Soda added a comment.EditedThu, Oct 15, 4:17 PM

I'm happy if there is another way to fix this. But note that it's not as easy as the commit message in https://gerrit.wikimedia.org/r/634105 makes it sound like. T263371 was about an actual XSS attack vector. I might be wrong, but I believe this means we should apply an alternative fix first, before reverting the previous one.

Would just sanitizing HTML entities work? That way we could pass the text through Sanitizer::removeHTMLtags() which would offer some amount of protection against XSS. wfEscapeWikitext() outright sanitizes all single quotes('), double quotes(") and double newlines (\n\n or \r\n or \r\r) which does make proofreading quite a bit difficult.

Tpt added a comment.Thu, Oct 15, 4:19 PM

@thiemowmde T263371 is indeed an XSS attack vector if you output directly the file content in HTML. But ProofreadPage only uses the file content to prefill a Wikitext content area when a Page: page is created. So, I guess there are no extra threat here compared with just allowing anyone to edit the wiki.

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

Xover added a comment.Thu, Oct 15, 5:17 PM

Not having access to T263371 it's hard to say anything intelligent about the specific issue, but…

…untrusted input should ideally be sanitized at the point it first hits the system. For PDF/DjVu text layers this would be the point when it is first extracted and stuffed into iiprop (from where PRP subsequently gets it). Otherwise we'll have untrusted data sitting in a metadata field in the database, and every single bit of code, current or future, that hits it will have to worry about untainting it. And data you get from your own DB is very easy to assume is already sanitized (unless you introduce a concept of "tainted storage" like Perl's taint mode); it tends to "leak" into actually sanitized data when you least expect it.

Also, Commons must have dealt with this before for EXIF, so maybe there are useful examples lurking around somewhere to use as a template. Granted Commons can just nerf the data with prejudice, but the approach could still be transferable.

Tpt added a comment.Thu, Oct 15, 6:57 PM

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

T263371 is about a proprietary extension that directly outputs the PDF/DjVu text layer in HTML. Not about ProofradPage text layer prelod usage.

This problem is particularly annoying in the French Wikisource because the HTML entities are not even valid (an extra space is added before the semi-colon, maybe because of French typography rules), which makes necessary to fix all of these entities by hand in every page (example).

Apparently the HTML entities are fixed automatically in the English Wikisource (when I try in this book). ~~~~

Change 634416 had a related patch set uploaded (by Reedy; owner: Tpt):
[mediawiki/extensions/ProofreadPage@wmf/1.36.0-wmf.13] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634416

Reedy added a subscriber: Reedy.Fri, Oct 16, 10:28 AM

… then I probably still don't get how the scary output shown in T263371 was created. If this can not happen in production, why is there a ticket?

Third party wiki (and extension) using the functionality but not escaping it themselves, seemingly

Change 634105 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634105

Change 634416 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@wmf/1.36.0-wmf.13] Revert "Fix escaping in PageContentBuilder::buildDefaultContentForPageTitle"

https://gerrit.wikimedia.org/r/634416

Mentioned in SAL (#wikimedia-operations) [2020-10-16T10:35:41Z] <reedy@deploy1001> Synchronized php-1.36.0-wmf.13/extensions/ProofreadPage/: Revert excessive escaping T265571 (duration: 01m 12s)

Apparently the HTML entities are fixed automatically in the English Wikisource (when I try in this book). ~~~~

English Wikisource deployed a default Gadget that simply does the reverse of the original patch earlier today (it replaces character entity references with literal characters). But judging by the logs and my quick testing this appears to have now been rolled back so the Gadget is no longer needed.

Seudo added a comment.Fri, Oct 16, 1:47 PM

English Wikisource deployed a default Gadget that simply does the reverse of the original patch earlier today (it replaces character entity references with literal characters). But judging by the logs and my quick testing this appears to have now been rolled back so the Gadget is no longer needed.

Yes, I confim it appears to be fixed in the French Wikisource too.

matmarex closed this task as Resolved.Fri, Oct 16, 8:44 PM