Page MenuHomePhabricator

Docx files created using LibreOffice are incorrectly detected as zip files
Closed, ResolvedPublic

Description

MediaWiki contains custom code to detect the mime types of files in the MimeAnalyzer class, in the "doGuessMimeType" method. In particular, it has the "detectZipType" method to detect different zip file types, such as OpenDocument and OpenXML. To detect an OpenXML file, this method looks for a zip entry which matches the /^\[Content_Types\].xml/ regex.

LibreOffice stores this entry last in the zip file, whereas Microsoft Word stores the entry first in the zip file.

mediawiki_docx.png (496×417 px, 40 KB)

This means that the header check at https://github.com/wikimedia/mediawiki/blob/master/includes/libs/mime/MimeAnalyzer.php#L842 does not detect the LibreOffice docx file as OpenXML. The method fails to detect the type of the zip file and MediaWiki refuses this file.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

In light of the other task T291752, I'll mention that it's possible this built-in check might be too broad, thus preventing it from getting an "unknown" response that would lead to the external method being used. Having said that, office files are a tricky case because their resemblance to ZIP files isn't coincidental. They are in fact fully valid ZIP files, so this is likely a narrow case where we'll keep the override in place but rather fix the bug in question. If it were a problem with any other type, then likely we'd look at making sure the extenral method is correctly invoked, but that might not be an option in this case.

@DanielsThomas Which version of MediaWiki is this with? Does it happen on the latest MW 1.35 or 1.36 releases?

We just tested it on MediaWiki 1.36.2 and can confirm the bug is still present.

Chatted with @tstarling, who has fixed similar issues in the past (such as for legacy MS Office files). The MimeAnalyzer currently has quite an outdated and crude implementation for ZIP-like files, based on regular expressions.

We already have a more correct and less vulnerable handler (ZipDirectoryReader) which is currently used during upload checks. We should consider using this in MimeAnalyzer as well as the basis for what detectZipType() currently does.

Chatted with @tstarling, who has fixed similar issues in the past (such as for legacy MS Office files). The MimeAnalyzer currently has quite an outdated and crude implementation for ZIP-like files, based on regular expressions.

We already have a more correct and less vulnerable handler (ZipDirectoryReader) which is currently used during upload checks. We should consider using this in MimeAnalyzer as well as the basis for what detectZipType() currently does.

This sounds interesting. Not sure if we can do anything to make this progress further?

With ZipDirectoryReader you can tell whether a file is ODF or OPC. To detect subtypes you really need to look at the contents of the files within the zip package, e.g. with ext-zip. But maybe we can back out from that rabbit hole and rely on the file's extension in that case. For OPC we are apparently already relying on the extension, but for ODF we are apparently relying on the contents of the mimetype file being uncompressed and immediately following the file name, which is a bit dodgy.

From a security perspective, maybe it is defensible to look at the start of the file and to require something specific to be there, because there could be malicious magic numbers which will activate unexpected applications.

Change 788451 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Call ZipDirectoryReader from MimeAnalyzer

https://gerrit.wikimedia.org/r/788451

Change 788451 merged by jenkins-bot:

[mediawiki/core@master] Call ZipDirectoryReader from MimeAnalyzer

https://gerrit.wikimedia.org/r/788451

Krinkle assigned this task to tstarling.