Page MenuHomePhabricator

Some Microsoft Powerpoint files not detected by MimeAnalyzer.php due to 'ppt/presentation.xml' at 30th byte
Open, LowPublic

Description

Mediawiki version 1.35

The file MimeAnalyzer.php detects Microsoft office documents by inspecting the string [Content_Types.xml] at the 30th byte of the file.
However, some of my Powerpoint files (.pptx) ended up having the string ppt/presentation.xml at the same location.

Propose the fix on line 825
$openxmlRegex = "/^\[Content_Types\].xml|^ppt\/presentation.xml/";

Event Timeline

Aklapper renamed this task from Microsoft Powerpoint file to Some Microsoft Powerpoint files not detected by MimeAnalyzer.php due to 'ppt/presentation.xml' at 30th byte.Dec 24 2020, 9:11 AM
Aklapper added a project: good first task.
Aklapper updated the task description. (Show Details)

Hi @thaing, welcome to Wikimedia Phabricator, and thanks for taking the time to report this and taking a look at the code!

If you feel like proposing a patch, then you are very welcome to use developer access to submit the proposed code changes as a Git branch directly into Gerrit which makes it easier to review and provide feedback. If you don't want to set up Git/Gerrit, you can also use the Gerrit Patch Uploader. Thanks.

Change 651864 had a related patch set uploaded (by Aklapper; owner: Thai Nguyen):
[mediawiki/core@master] Allow some .pptx files that have the string 'ppt/presentation.xml' at the 30th byte in the file to identify as application/x-poc+zip

https://gerrit.wikimedia.org/r/651864

How to get a review of this trivial one-liner patch from code stewards, whoeever they are? CC'ing Platform Engineering as this is about MW Core.

Before merging that, I would need to understand what is going on in that function and why the special case is necessary. We also need a small test file. The big question is whether the special case is sufficient or whether it only works for a particular document.

@thaing: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Change 651864 abandoned by Reedy:

[mediawiki/core@master] Allow some .pptx files that have the string 'ppt/presentation.xml' at the 30th byte in the file to identify as application/x-poc+zip

Reason:

Iff1611c7adda9c0f0ed31593bad6dfffc9c9a086 means this would need reworking quite a bit

https://gerrit.wikimedia.org/r/651864

Patch for T291750: Docx files created using LibreOffice are incorrectly detected as zip files made this no longer apply, and would need reworking somewhat to carry on (the regex no longer exists, and the check is now a direct equality check setting a type)