Page MenuHomePhabricator

Image upload fail with error bad-zip due to being misdetected as a broken zip irrespective of mimetype
Open, MediumPublic

Details

Reference
bz31930

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:53 PM
bzimport set Reference to bz31930.
bzimport added a subscriber: Unknown Object (MLST).

Well, it does trip signature checks in other zip tools:

$ unzip -vb A.jpg
Archive: A.jpg
??ڱuXâ??ps?E;*w=V?>??^??=Hڤ濇`F?Y]f?3?ہ?q???}ʢp???["?G?%?????#?:???(3????8???%>x~:?Xދ???-????!?{q????z??S3IH???ҘGh??A?"9?TH?U??g????&???_?x??????K'Q$-<q:?W?I9??R?h?)??b8?i?;?T??©\?OnK?N????l???U<ܬ????
?'ڮ?.ā1?d????
caution: zipfile comment truncated
warning [A.jpg]: zipfile claims to be last disk of a multi-part archive;

attempting to process anyway, assuming all parts have been concatenated
together in order.  Expect "errors" and warnings...true multi-part support
doesn't exist yet (coming soon).

error [A.jpg]: missing 2744938539 bytes in zipfile

(attempting to process anyway)

error [A.jpg]: attempt to seek before beginning of zipfile

(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)

but it's pretty clearly not a real zip file, as you get some pretty broken results trying to unzip it. :)

So these bugs are cause by the 'JAR detection' of bug 24230 / r82783

Retested on https://test.wikipedia.org/ , the attachment causes the following error:
"The file is a corrupt or otherwise unreadable ZIP file. It cannot be properly checked for security."

$ file A.jpg
A.jpg: JPEG image data, JFIF standard 1.02

The same is happening with a GIF on bug 54105

The error code for that error is 'zip-bad', and it is only raised in includes/utils/ZipDirectoryReader.php, and ZipDirectoryReader is only used in includes/upload/UploadBase.php, and only if $wgAllowJavaUploads is disabled, which it is by default.

https://www.mediawiki.org/wiki/Manual:$wgAllowJavaUploads

(That variable should be renamed $wgCheckAllUploadsForJava , and it would be really handy if someone documented the checks performed in a technical manner like 'It looks for the byte sequence "50 4B 05 06"', so the average sysadmin can know that they have found the source of the problem with a particular upload.)

My concern is that if the ZIP file is unreadable, why does it need to be checked for security? My guess is that this check assumes the presence of the jifar bug, and other unidentified bugs in the JVMs ZIP reader, which combined allow an unreadable ZIP to be loaded as a JAR. Paranoid much? ;-) There are quite a few JAR reading bugs across all JVMs, so it doesnt hurt to be a bit paranoid, but maybe the detector can be improved a little to determine that some of these false positives are not actually going to be successfully executed by any JVM.

There are also other image reading bugs in various browsers which allow arbitrary execution of code; are we checking all images for those bugs? (I couldnt see code for that in UploadBase) If we increase the number of file types supported, the number of potential problems increases.

Another approach is to flag these bad-zip uploads (and other uploads which trip bugs in some browsers) as not readily able to be used in an IMG/etc tag, place them in a special maintenance category (and display a warning), and only serve them as a MIME-encoded download from a host that is not *.wikimedia.org. That way the uploads happen, and the problem becomes visible to the community, who can track these uploads and improvements to the Java detector code are more likely to occur. (the flag can be reset on re-upload after the algorithm has been improved)

It would also be nice for this error, and other security checks like script checks, to be a warning only based on a config variable. In wikis without unrestricted uploaders, or an environment where all clients run a fully patched SOE, checks like these are nice warnings, but shouldnt be unsolvable errors (i.e. without turning off the security check).

I just run into this problem when I today tried to upload this JPEG file. Despite being a correct JPEG file it appears to be considered as a corrupted part of a multi-volume ZIP archive by the unzip utility of info-zip.org:

dublin$ unzip -l 20130917-IMG_5344-pc.jpg
Archive:  20130917-IMG_5344-pc.jpg
¥ÔÝ©¤'
caution:  zipfile comment truncated
warning [20130917-IMG_5344-pc.jpg]:  zipfile claims to be last disk of a multi-part archive;
  attempting to process anyway, assuming all parts have been concatenated
  together in order.  Expect "errors" and warnings...true multi-part support
  doesn't exist yet (coming soon).
error [20130917-IMG_5344-pc.jpg]:  missing 7227269008 bytes in zipfile
  (attempting to process anyway)
error [20130917-IMG_5344-pc.jpg]:  attempt to seek before beginning of zipfile
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

My workaround was to recreate the JPEG file from Gimp and tweaking one of the JPEG options. The error messages by unzip where then different:

dublin$ unzip -l 20130917-IMG_5344-pc.jpg
Archive:  20130917-IMG_5344-pc.jpg
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of 20130917-IMG_5344-pc.jpg or
        20130917-IMG_5344-pc.jpg.zip, and cannot find 20130917-IMG_5344-pc.jpg.ZIP, period.
dublin$

This modified file could be uploaded without any errors.

I wondered how it can be that a JPEG file is mistaken as ZIP archive when the magic numbers at the begin of the file allow to distinguish them easily. Interestingly, unzip does not check any initial headers but instead attempts immediately to locate the directory at the end of the archive:

stat64("20130917-IMG_5344-pc.jpg", 0x000495D0)  = 0
open64("20130917-IMG_5344-pc.jpg", O_RDONLY)    = 4
mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1,
 0) = 0xFF380000
ioctl(1, TIOCGWINSZ, 0xFFBFF508)                Err#22 EINVAL
Archive:  20130917-IMG_5344-pc.jpg
write(1, " A r c h i v e :     2 0".., 35)      = 35
llseek(4, 4308992, SEEK_SET)                    = 4308992
read(4, " LD0D2E1 t 7 V05 ^ _8710".., 5829)     = 5829
llseek(4, 4300800, SEEK_SET)                    = 4300800
read(4, "\r ?8E82 h E ` J EC0DB9C".., 8192)     = 8192
read(4, " LD0D2E1 t 7 V05 ^ _8710".., 8192)     = 5829
read(4, 0x00049E68, 8192)                       = 0

In consequence, whenever the end of a file possibly looks like a ZIP directory structure, we have a problem. It is perhaps not very likely but it happens. It would probably be best to check a file for being indeed a ZIP or JAR using the file utility before proceeding to unzip it.

In consequence, whenever the end of a file possibly looks like a ZIP directory structure, we have a problem. It is perhaps not very likely but it happens. It would probably be best to check a file for being indeed a ZIP or JAR using the file utility before proceeding to unzip it.

$ file check for the magic numbers at the start of the file before checking those at the end of the file. A file can be both a JPG and ZIP, as explained on Encyclopedia Dramatica (the site is full of ads though), and for those, $ file find it's a JPG rather than both JPG and ZIP.

Apparently $ file does not see the magic ZIP numbers at all:

$ tail -c +100 20130917-IMG_5344-pc.jpg > 20130917-IMG_5344-pc.tailed
$ file 20130917-IMG_5344-pc.tailed 
20130917-IMG_5344-pc.tailed: data
$ file 20130917-IMG_5344-pc.tailed -i
20130917-IMG_5344-pc.tailed: application/octet-stream; charset=binary
$ ls -l 20130917-IMG_5344-pc.*
-rw-r--r-- 1 zhuyifei1999 wikidev 4314821 Feb 27 09:13 20130917-IMG_5344-pc.jpg
-rw-r--r-- 1 zhuyifei1999 wikidev 4314722 Feb 28 03:40 20130917-IMG_5344-pc.tailed

file looks at the beginning of a file. Example:

dublin$ ls
dublin$ echo Hi >file1
dublin$ echo Hello >file2
dublin$ zip archive file?
  adding: file1 (stored 0%)
  adding: file2 (stored 0%)
dublin$ file archive.zip
archive.zip: Zip archive data, at least v1.0 to extract
dublin$

The corresponding rules are to be found in /usr/share/file/magic/archive:

# ZIP archives (Greg Roelofs, c/o zip-bugs@wkuvx1.wku.edu)
0       string          PK\003\004
>30     ubelong         !0x6d696d65
>>4     byte            0x00            Zip archive data
!:mime  application/zip
>>4     byte            0x09            Zip archive data, at least v0.9 to extract
!:mime  application/zip
>>4     byte            0x0a            Zip archive data, at least v1.0 to extract
!:mime  application/zip
>>4     byte            0x0b            Zip archive data, at least v1.1 to extract
!:mime  application/zip
>>0x161 string          WINZIP          Zip archive data, WinZIP self-extracting
!:mime  application/zip
>>4     byte            0x14            Zip archive data, at least v2.0 to extract
!:mime  application/zip

In other words, ZIP file can be easily detected as such by examining the first four byte "PK\003\004". And a version number of the ZIP format is at byte position 4 and another magic number is to be found at position 30.