Refuse uploading JPEG files with extra junk at the end.
Open, MediumPublicFeature
Actions

Assigned To

None

Authored By

	Rillke
	Apr 5 2013, 10:14 AM

Description

Original title: Refuse uploading files that contain huge data of other file types, especially if this data is encrypted

Version: 1.22.0
Severity: enhancement

Details

Reference: bz46921

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T48921 Refuse uploading JPEG files with extra junk at the end.
		Resolved		zhuyifei1999	T12847 Detect RAR concatenation in jpeg images

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:15 AM

• bzimport added a project: MediaWiki-Uploading.

• bzimport set Reference to bz46921.

• bzimport added a subscriber: Unknown Object (MLST).

Rillke created this task.Apr 5 2013, 10:14 AM

Wondering at which exact state this refusal would take place.
This request might be "UploadWizard" (or "Special:Upload") territory instead of "File management".

Wondering at which exact state this refusal would take place.
This request might be "UploadWizard" (or "Special:Upload") territory instead
of
"File management".

Probably the same stage we do other file type checks. (On the backend after the upload)

Given that these files have been deleted, could an example be attached to this bug so we can see what the file actually looks like?

I'm given to understand these were valid JPEG's with extra junk in metadata segments? I'm not sure we would be able to strip that without worrying about damaging real metadata.

If the images just had extra data embedded into the image data using stenography, it would be pretty difficult to detect in general.

Created attachment 12039
sample: https://commons.wikimedia.org/w/index.php?title=File:Fresh_Relic_4531.JPG

My computer just crashed (step by step human input devices stopped working) after viewing one of them so I really hope they do not contain evil code.

Attached:

File.Fresh_Relic_4531.JPG (492×489 px, 9 MB)

Created attachment 12040
sample: https://commons.wikimedia.org/w/index.php?title=File:PencilCherryTable.JPG

Attached:

FilePencilCherryTable.JPG (426×640 px, 4 MB)

(In reply to comment #3)

Created attachment 12039 [details]
sample:
https://commons.wikimedia.org/w/index.php?title=File:Fresh_Relic_4531.JPG

My computer just crashed (step by step human input devices stopped working)
after viewing one of them so I really hope they do not contain evil code.

This one seems to contain a password protected file. Opening it with 7z prompts for a password. The second one (12040) should contain something also, although 7z was unable to detect an archive there. As one can see, both images are displayed fast (while downloading) and then the browser keeps downloading data even if the image is already displayed.

As I've read, it's extremely easy to add any file inside a jpeg and yet have an absolutely valid image that displays perfectly. It can be done by just concatenating the contents of a file to an existing jpeg image.

Attached:

Hmm, if its just stuff concatenated at the end, it would probably be possible to detect (Look for the \xFF\xD9 marker, see if anything after it) [From a security paranoia, doing this would probably not be a bad idea. GIFAR and all]

Looking at these files, they are indeed just stuff stuffed at the end.

For 1239:

00011d40 e6 93 34 a7 ad 25 0b 61 85 14 51 4c 0f ff d9 37 |..4..%.a..QL...7|
00011d50 7a bc af 27 1c 00 03 d8 f3 90 3d 40 84 9c 00 00 |z..'......=@....|

Note the ff d9 denotes end of image (EOI). After that 37 7A BC AF 27 1C are the magic numbers for a 7z archive.

For the second image (1240) we have:

0000dc80 dd cf a1 f5 a6 9e b4 87 a9 a1 6b a8 92 3f ff d9 |..........k..?..|
0000dc90 43 d6 cd 64 8a dc f7 24 57 18 a8 2f e3 dd 38 34 |C..d...$W../..84|

Which doesn't have any magic numbers that I could see. However, it definitely doesn't appear to be JPEG data as we later on have ff sequences that aren't escaped. Maybe its the second part to some file split up over multiple jpegs or maybe encrypted, or something else.

• Gilles added a project: Multimedia.Nov 24 2014, 3:42 PM

Nemo_bis subscribed.Mar 7 2015, 10:03 PM

Nemo_bis mentioned this in T59806: Automatic analysis of Commons images on accessibility for color-blind.Mar 7 2015, 10:11 PM

Rillke mentioned this in T93679: Uploading and Displaying 3D files with Fallback.Mar 25 2015, 11:39 PM

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:32 PM

Restricted Application added subscribers: Steinsplitter, Matanya. · View Herald TranscriptSep 4 2015, 6:32 PM

Dispenser subscribed.Oct 27 2015, 1:42 AM

FYI, https://commons.wikimedia.org/wiki/Commons:Administrators%27_noticeboard#Influx_of_files_with_embedded_data_.28CSD.23F9.29 seems to indicate a need for more urgent attention.

Ninjastrikers subscribed.Nov 28 2016, 6:26 PM

Srittau subscribed.Nov 28 2016, 6:28 PM

I don't think we can detect this with a 100% accuracy (we'd essentially have to write a JPEG decoder, then use it to decode the file, and see if there's anything left over), but we could probably reject the uploads based on some crude heuristic (e.g. if the data is longer than it would be for an uncompressed file of these dimensions, something is clearly fishy). But I'm afraid this will end up in an arms race (if we plug this for JPG files, these folks will just switch to another file format that is more difficult to evaluate).

Dispenser added a subtask: T12847: Detect RAR concatenation in jpeg images.Nov 28 2016, 7:24 PM

Run something like jpegtran -o -copy none on it and discard it if the size reduction is significant? (Although jpegtran is designed to be lossless so probably there are better choices.)

Unless jpegtran is smarter than it should be, that won't help for files without the end-of-image marker (where essentially the extra junk data is part of the image scan data).

Qgil unsubscribed.Nov 28 2016, 9:44 PM

Reedy mentioned this in T151821: Maintenance script to test already uploaded files.Nov 28 2016, 9:46 PM

I would suggest that this is a case where the perfect is the enemy of the good. It would be impossible to defend against a knowledgeable adversary using good steganographic techniques, who is trying to upload an extra ~1% payload. I think it should be relatively easy to prevent ~800MB .iso images and movies, etc. from piggybacking on pretty much any file. On the other hand, it has been pointed out that we can catch a lot of this stuff with AbuseFilters.

Neat, I actually forgot that abusefilter can do this now. For future reference: https://commons.wikimedia.org/wiki/Special:AbuseFilter/160. We'll have to be careful tuning this to avoid blocking legitimate uploads, though. The current rule is a bit more rigorous than I would've recommended, but probably fine. I wonder if this means we can close this task?

On the other hand, it does nothing against someone uploading a suitably big JPG file that has no image data, but a whole movie tacked on at the end… I feel like T12847: Detect RAR concatenation in jpeg images would be a better way to discourage this. Embedding RAR files is popular because they can be extracted by just renaming the file to .rar, without having to edit binary files (or use dedicated software).

@matmarex are there any samples available?

I downloaded the first deleted file from the link in the summary:

$ cat WriterWavePlot.JPG | wc -c
24608472
$ jpegtran -o -copy none WriterWavePlot.JPG | wc -c
30888

Or you can just reuse the existing thumbnailing system to generate a thumbnail that's 1px smaller, and see if there is more than say 10% size difference.

In T48921#2829328, @Tgr wrote:

@matmarex are there any samples available?

I don't know if we had any uploaded, but you just need to take any JPG file, remove last two bytes, append garbage. I could create and upload an example.

In T48921#2829346, @Tgr wrote:

Or you can just reuse the existing thumbnailing system to generate a thumbnail that's 1px smaller, and see if there is more than say 10% size difference.

Thumbnailing strips all metadata, and legitimate metadata can be pretty large (e.g. https://commons.wikimedia.org/wiki/File:Profilfoto_FB.jpg, some more examples can be found in https://commons.wikimedia.org/wiki/User:Dispenser/Absurd_overhead).

In T48921#2829328, @Tgr wrote:

@matmarex are there any samples available?

See T48921#484603

zhuyifei1999 subscribed.Nov 29 2016, 2:42 AM

Fae subscribed.Dec 2 2016, 4:14 PM

Poyekhali subscribed.Dec 16 2016, 11:33 AM

NahidSultan subscribed.Dec 23 2016, 9:18 AM

FYI: I have coded (or I am coding) a bot to automatically detect such files. However, JPGs are weird in that many JPGs files seem to contain useless extra junk, but I have yet to understand what these extra junk actually contains, and whether they are legitimate.

zhuyifei1999 closed subtask T12847: Detect RAR concatenation in jpeg images as Resolved.Apr 22 2017, 7:26 AM

MarkTraceur lowered the priority of this task from High to Medium.Jun 5 2017, 3:10 PM

Nemo_bis mentioned this in T167400: Disable serving unpatrolled new files to Wikipedia Zero users.Jun 10 2017, 9:03 PM

• Tbayer subscribed.Jun 30 2017, 12:34 AM

I think we can move on this by adding a warning for files where 2/3 of the file is metadata, possibly only for files above a certain threshold (500kb or so may do the trick)

Tgr updated the task description. (Show Details)Sep 11 2017, 11:04 PM

Note that "metadata" and "extra junk at the end" are different things. The first is about using fields provided by the file type spec to store arbitrary data; we already have code for detecting most of these (since we want to index metadata and whatnot) so we just need to measure it and decide what's a reasonable size limit. (See also T170251.) The second is about violating the file type spec in ways that are ignored by most tools (e.g. the file is supposed to consist of width then height then width x height bytes of pixel color; adding more bytes at the end will be ignored by the viewer but data put there can be recovered by a custom-made tool, or just splitting the file). Bawolff shared a real example for JPEG in T48921#484617. These are probably going to be harder to detect (but could be easily handled by something like T67383).

• Ramsey-WMF moved this task from Next up to Triaged on the Multimedia board.Mar 8 2019, 2:30 AM

DannyS712 subscribed.Apr 27 2020, 3:26 AM

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM

Aklapper removed subscribers: • Tbayer, • wikibugs-l-list.

Stang subscribed.Apr 11 2022, 9:33 PM

Refuse uploading JPEG files with extra junk at the end.Open, MediumPublicFeatureActions

Description

Details

Related ObjectsSearch...

Event Timeline

Probably the same stage we do other file type checks. (On the backend after the upload)

Hmm, if its just stuff concatenated at the end, it would probably be possible to detect (Look for the \xFF\xD9 marker, see if anything after it) [From a security paranoia, doing this would probably not be a bad idea. GIFAR and all]

Refuse uploading JPEG files with extra junk at the end.
Open, MediumPublicFeature
Actions

Related Objects
Search...