Description
Related Objects
- Mentioned In
- T12847: Detect RAR concatenation in jpeg images
- Mentioned Here
- T12847: Detect RAR concatenation in jpeg images
Event Timeline
It has been a long time since I last looked at that problem, but… isn't the free unrar enough for detecting that it is indeed a rar file and not a random file containing Rar! ?
The Free unrar can only decompress archives created by RAR versions prior to 2.9 (2002). All hidden archives I've found were using the new format.
But is full decompression needed? Just an unrar -l would be enough to confirm that there's an extra file added, wouldn't it?
$ unrar-free --list Camera_10125.jpg
unknown archive type, only plain RAR 2.0 supported(normal and solid archives), SFX and Volumes are NOT supported! Pathname/Comment Size Packed Ratio Date Time Attr CRC Meth Ver ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- 0 0 0 -nan%
$ unrar-nonfree L Camera_10125.jpg
Archive: Camera_10125.jpg Details: RAR 5, SFX Attributes Size Date Time Name ----------- --------- -------- ----- ---- -rwxrwxr-x 26812422 28-11-16 14:06 Tutorial.mp4 ----------- --------- -------- ----- ---- 26812422 1
The source code is available at http://www.rarlab.com/rar_add.htm and I got it compiled on tool labs. Due to the increasing complexity of abuse in T129845 I would very much like to use this tool as a sidekick for existing anti-abuse algorithms. The license is certainly more restrictive and not very libre; but I would like clarification of whether running it on tool labs is allowed.
As far as I can see, the license of unrar is non-libre, so it would (unfortunately) not be allowed on Labs.
However, for the purpose of detecting RAR files, uncompressing is maybe not necessary -- just detecting the RAR header (52 61 72 21 1A 07 01 00 for RAR5) might already enough. Of course, there could be some false positives, but these can for example be improved upon by adding better heuristics (e.g. based on notes on the file structure, http://www.forensicswiki.org/wiki/RAR).
I would suggest to start with the RAR header detection; it's 8 bytes long, so false positives should only happen roughly once every ~1e19 bytes = million TB, which sounds not too bad :-)