Detect RAR concatenation in jpeg images
Closed, ResolvedPublic
Actions

Assigned To

zhuyifei1999

Authored By

	• bzimport
	Aug 8 2007, 9:21 PM

Description

Author: lilewyn

Description:
HOW TO: Download the linked file (req. admin access on enwiki), rename to .rar, extract.
PROBLEM: Users using Wikipedia as RapidShare replacement by appending compressed files to legitimate graphics uploaded to our servers.
POSSIBLE SOLUTION: Add code to detect RAR compression appended to valid graphics files and fail the upload.

Version: unspecified
Severity: enhancement
URL: http://en.wikipedia.org/w/index.php?title=Special:Undelete&target=Image%3AStar_Wars_Republic_Commando_Triple_Zero.jpg&file=0a52grdaxrtrnm5wuk90bemab69hsnrs.jpg

Details

Reference: bz10847

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open	Feature	None	T48921 Refuse uploading JPEG files with extra junk at the end.
		Resolved		zhuyifei1999	T12847 Detect RAR concatenation in jpeg images

Event Timeline

• bzimport raised the priority of this task from to Low.Nov 21 2014, 9:48 PM

• bzimport added a project: MediaWiki-File-management.

• bzimport set Reference to bz10847.

• bzimport added a subscriber: Unknown Object (MLST).

• bzimport created this task.Aug 8 2007, 9:21 PM

Why look for RAR and not five million other archive formats? What about trivially obfuscated files? Encrypted files? etc.

Alkivar wrote:

(In reply to comment #1)

Why look for RAR and not five million other archive formats? What about
trivially obfuscated files? Encrypted files? etc.

its simple really... your average jpg viewer stops reading the file after the end tag. rar ignores anything prior to the rar header. so you've got the perfect combination with jpg and rar. But a few other archive formats/image formats could potentially work. There are tutorials all over the internet including the EN WP article on RAR showing how to do the jpg/rar combination though.

Convenient. :)

Greg's putting together a list of files with known issues, we'll have a good test set of this and other formats.

lilewyn wrote:

http://commons.wikimedia.org/wiki/User:Gmaxwell/possiblyevilimages

http://en.wikipedia.org/wiki/User:Gmaxwell/possibly_evil_images

Lists not yet filtered.

Note that commons uploads are being checked (third-party) for embedded rars.

• Gilles added a project: Multimedia.Nov 24 2014, 3:40 PM

Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:05 PM

Restricted Application added subscribers: Steinsplitter, Matanya, Aklapper. · View Herald TranscriptSep 4 2015, 6:05 PM

Dispenser added a parent task: T48921: Refuse uploading JPEG files with extra junk at the end..Nov 28 2016, 7:24 PM

Restricted Application added a project: Commons. · View Herald TranscriptNov 28 2016, 7:24 PM

Dispenser mentioned this in T151794: Install unrar on Tool Labs.Nov 28 2016, 7:44 PM

Reedy mentioned this in T151821: Maintenance script to test already uploaded files.Nov 28 2016, 9:46 PM

We could just search the files for the string "Rar!" (file header for RAR archives). But I'm not sure how often this could just randomly occur in the image data.

(Well, assuming random data and 4 MB photos, it's about 1 in 1000 files, which is unacceptably high. But perhaps JPEG data is not so randomly distributed and the chance is much smaller. With T151821, we could see how often it occurs in our existing files.)

Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptNov 28 2016, 10:52 PM

matmarex mentioned this in T48921: Refuse uploading JPEG files with extra junk at the end..Nov 28 2016, 10:53 PM

Fae subscribed.Dec 2 2016, 4:17 PM

zhuyifei1999 subscribed.Jan 24 2017, 8:37 AM

Analyzing a large JPEG I uploaded a while ago, with strings + grep, I got a few close calls to Rar! signature:

$ strings in.jpg | grep 'Rar'
MRar
sYRar5
RarU
Rar	
Rare
Rar6Y
+%hRar
Rar1z
$ strings in.jpg | grep 'ar!'
ar!	JQ
{ar!	
ar!K]&F@
Iar!
ar!W
ar!*L
%ar!

Also, playing with a visual representation of the first few megabytes of the file's binary contents, using $ < in.jpg rawtoppm -rgb 1024 1024 | pnmtopng > out.png, with the JPEG I got:

Elisa_Bonaparte_with_her_daughter_Napoleona_Baciocchi_-_François_Gérard_-_Google_Cultural_Institute.jpg.vis.png (1×1 px, 3 MB)

And with /dev/urandom:

I'd say the JPEG data is quite random.

tom29739 subscribed.Apr 10 2017, 1:01 PM

matmarex's method, with a bit more work parsing the image, would work. See for example: http://stackoverflow.com/a/4614629/342196 Rather than detecting a specific file format, once you reach to the first FFD9 (the real jpeg EOF) , if we are not at the end of the file, then you have detected a problem.

Look, FFD9 is not a mandatory marker. See comment: The end-of-file marker in JPEG files is optional, so this doesn't really help. Matma Rex (talk) 19:49, 28 November 2016 (UTC)

User:Embedded Data Bot currently parses the JPEG with Pillow to find the EOF, but there can be false positives / negatives sometimes.

Then FFDA from parsing + offset + [optionally] FFD9 and in addition, some heuristics on file size would be more reliable.

The only other 100% secure option would be to losslessly convert the files before making them public to remove unknown blobs (or detect the size).

BTW, the RAR signature are 8 bytes: RAR 5.0 signature consists of 8 bytes: 0x52 0x61 0x72 0x21 0x1A 0x07 0x01 0x00. You need to search for this signature in supposed archive from beginning and up to maximum SFX module size. Very, very unlikely to happen by accident in the first megabyte (SFX zone), -no need for a full rar parser to detect that. http://www.rarlab.com/technote.htm#arcblocks

(Claiming per bot task)

RAR 5.0 is apparently a completely different format from the previous versions. I think I saw it once, among a dozen or so funky files I examined some time ago.

From the same page: RAR 4.x 7 byte length signature: 0x52 0x61 0x72 0x21 0x1A 0x07 0x00

Yeah, @valhallasw also found some docs on the structure of the 4.0 format. Thanks!

It is not my intention to tell you how to do this, I was just trying to help doing it without having to install proprietary software on wikimedia servers due to T151794 rejection.

Closing this as resolved as this has been done with a bot, which is currently approved and active on Commons only. If anyone want to implement it to MediaWiki core so that all MediaWiki installs could have automated detection, feel free to reopen.

	F5377236: Elisa_Bonaparte_with_her_daughter_Napoleona_Baciocchi_-_François_Gérard_-_Google_Cultural_Institute.jpg.vis.png
	Jan 27 2017, 6:07 PM

Detect RAR concatenation in jpeg imagesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Detect RAR concatenation in jpeg images
Closed, ResolvedPublic
Actions

Related Objects
Search...