Page MenuHomePhabricator

Allow uploading (almost) any type of file by recognizing "nasty" files automatically
Closed, ResolvedPublic

Assigned To
None
Authored By
daniel
Nov 16 2004, 8:37 PM
Referenced Files
F1353: MimeMagic.php
Nov 21 2014, 7:02 PM
F1352: ImagePage.php
Nov 21 2014, 7:02 PM
F1354: image-5.patch
Nov 21 2014, 7:02 PM
F1351: Language.php
Nov 21 2014, 7:02 PM
F1350: patch-img_media_type.sql
Nov 21 2014, 7:02 PM
F1348: mime.info
Nov 21 2014, 7:02 PM
F1349: mime.types
Nov 21 2014, 7:02 PM
F1347: LanguageDa.php.diff
Nov 21 2014, 7:02 PM

Description

As of now, most file types are disallowed for uploads to the wikipedia. As i
understand, this is due to the fact that MSIE can be tricked into executing
potentially harmful javascript-code in any file, regardles of mime-type or file
extension. This is the case because MSIE will interpret all files that "somehow
look like" HTML as such. In the following i would like to propose a solution to
this problem.

It would be good to be again able to upload files in formats like MIDI and SVG -
SVG ist especially tricky, as it allows JavaSCript by specification. This types
of files would be hande so the data can be easily edited and re-used by others,
an especially because the GFDL calls for the "transparent source" of a document,
which is at the moment impossible to provide via Wikipedia.

But it is also sad but true that javascript is in general not very secure, be it
in HTML or in SVG. Thus, i would suggest doing the following:

a) detect and reject all files that "somehow look like" HTML, emulating MSIEs
guess - as i hear there is already experimental code for this in the CVS. I
re-wrote a trivial version of this guess for the code presented at the URL given
below.
b) detect and reject all files containg javascript code. This will probably
produce some false positives, but that is better than rejecting all but a few
formats alltogether. The dection function in the URL below is rather crude, but
should work most of the time.
c) scan for viruses on upload, using an external virus scanner. The wrapper
function i wrote is generic and can deal with any scanner that can be invoked
from the command line. The example uses clamav.

A prototype, proof-of-concept implementation can be viewed and tested here:

http://area23.brightbyte.de/checkfile-test.php

The source is available there, but would need some modifications to be
integrated into mediawiki (i guess - i have never looked into the MW-source, and
i don't plan to). For more information and suggestions, please go to

http://de.wikipedia.org/w/wiki/Benutzer:Duesentrieb/checkfile

I hope you like it and it's not to hard to put it in. It would be extremely
helpful if we could again upload "obscure" things like MIDI and SVG to the
Wikipedia.

PS: Bug #667 relates to the same problem, but is very specific in that it justs
requests a single format to be alowed. It also does not propose a solution.


Version: 1.5.x
Severity: enhancement
URL: http://area23.brightbyte.de/checkfile-test.php

Details

Reference
bz898

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:02 PM
bzimport set Reference to bz898.
bzimport added a subscriber: Unknown Object (MLST).

I would like to add another suggestion: Show an extra warning on the
description-page of media in a potentially dangerous format (or rather for any
non-image, non-sound format). The message could read something like this:

"This file may contain executable code that might damage your system.
If you download this file and open it on your computer, potentially
harmfult contents may be run. Please make sure you know what you
are doing. The Wikimedia Foundation does not take any resposibility for
the contents of this file or any harm it may do to your system."

This is not only aimed at macro-viruses hidden in doc-fiels etc (which the virus
scanner will hopefully find on upload), but maily at "trivial trojans" like
batch files containing a "format C:" or some such.

In that context, it meight also be good to block all files with the extensions
.exe, .bat, .cmd, .reg, .js as well as any files starting with a she-bang (#!),
regardles of the guessed mime-type. That is, there should be a list of forbidden
extensions in addition to a list of forbidden mime-types (or does that already
exist?), and an additional check for the she-bang. If you like, i could add that
to the sample code, but it's trivial.

When discussing this on wikitech-l, Rowan Collins suggested the following:

That leads to an interesting thought: presumably, non-image files are
already blocked from being included "inline"; but what about the magic
[[Media:...]] links? I know it's pretty dumb to open a file that you
don't know what it is, but some people genuinely are, and it would be
nice to do everything we can to help them. In other words, perhaps
it would make sense to have *three* whitelists:

  1. Images: can be uploaded, linked to direct, and included inline
  2. Other trusted files (e.g. sounds, videos): can be uploaded, linked

to directly with [[Media:]], but not included inline with [[Image:]]

  1. Potentially harmful files (e.g. .zip, .doc): can be uploaded (if

they pass a virus scan), but not linked to directly; users are always
directed to the description page, where they are presented with a
warning message.

brian wrote:

The Wikimedia Foundation's site (http://wikimediafoundation.org/)
should allow uploading of any file immediately, since only trusted
users can access it.

I can see three problems with virus scanning at this time:

  • we like free software...but can we get a free virus scanner with

decent definition updates?

  • some people may mistakenly think that it's foolproof
  • it will take a lot of processor time (expensive)

The second problem can be resolved with decent warnings, I suppose.
If we are relying on external hosting then we don't want to waste
processor time.

I don't like virus scanning...some are bound to slip through the net.

brian wrote:

Be careful with that warning! Some old MS-DOS manuals contain some
rubbish about users not being able to harm their computers whatever
they did - people may not take that warning seriously, particularly
in places where computers are just being introduced.

(In reply to comment #2)

The Wikimedia Foundation's site (http://wikimediafoundation.org/)
should allow uploading of any file immediately, since only trusted
users can access it.

Maybe, maybe not. Trusted users can have conterminated files on their box too.

  • we like free software...but can we get a free virus scanner with

decent definition updates?

Yes, camav. Also, the code I designed is very felxible, it can easily interface
to any scanner that can be run from the command line.

  • some people may mistakenly think that it's foolproof

Not more so than before - there is no reason to put up a sign "all files are
scanned for viruses". The user will not notice the scanning unless a virus is found.

  • it will take a lot of processor time (expensive)

Not really. We see about one image per minute uploaded to the commons. Scanning
a big jpg (1.8 MB) on my box (1.9 GHz P4) takes about 1 sec of CPU time.

I don't like virus scanning...some are bound to slip through the net.

Yes. But our alternatives are 1) to ave no net at all and b) to limit allowed
formats to a minimum. Note that there are virusses that can infeect JPGs, too
(they use a bug in the windows image rendering libs and effect all browsers).
Those are not detected right now.

I belive may suggestion should be implemented as soon as possible, maybe even
for 1.5; I already supplied code the implements the functionality, it should not
be too hard to integrate it into mediawiki. Maybe I'll try to do it mayself, but
i'm not sure when I get around to it.

The disability to supply sources for graphics is frustrating to many users who
would like to contribute SVG (we even have a renderer for that!), MIDI and
similar formats that can easily be edited by others, unlike bitmap images.

Created attachment 497
proof-of-concept patch implementing my suggestions. Please read the comment

Created a proof of concept patch. This is just for reference - it uses an ugly
bitfield logic, which has the advantage of being compatible iwth the current
1.5 DB schema. The Real Thing should however be based on three separate fields
in the database, for media class (image, drawing, audio, video, etc) and major
and minor mime type, as suggested by brion and tim.

This patch consists of three parts (which should probably go into separate
patches eventually):

  1. Changes to the verification code in SpecialUpload.php:
  2. improved mime detection / verification
  3. detection of javascript
  4. virus scanning
  1. Changes to Image.php that allow to determine if a medium is fit to be shown

inline, to be linked to directly, or is non-trusted formats that could contain
malicious code (i.e. anything else).

  1. Changes to ImagePage.php and Linker.php
  2. put a warning on the description page of non-trusted files
  3. disallow direct [[media:...]] links to non-trusted files

The patch also contains changes to DefaultSettings, Defines and Languages to
support the above.

Please have a look at the patch and comment, so I can implement changes when I
adopt this to the new and beter database schema.

attachment image.patch ignored as obsolete

Created attachment 511
Full patch for this bug. Needs additional files and a database change, files follow.

attachment image-3.patch ignored as obsolete

Created attachment 512
database patch for inserting new columns into the image table (path: maintenance/archives/patch-img_media_type.sql)

attachment patch-img_media_type.sql ignored as obsolete

Created attachment 513
mime type detection module

attachment MimeMagic.php ignored as obsolete

Created attachment 514
standard mime.types file (mapping mime types to file extensions), used by MimeMagic.php (path: includes/mime.types)

attachment ignored as obsolete

Created attachment 515
mime.info file defining mime type aliases and media types, used by MimeMeagic.php (path: includes/mime.info)

attachment mime.info ignored as obsolete

The attachments I just mad ned to be applied together. Note that this patch
changes the columns in the image table (this is done by the
patch-img_media_type.sql database patch): img_type is obsolete and replaced by
img_media_type, img_major_mime and img_minor_mime.

updater.inc was adjusted to apply this database patch, but it has not been tested.

Note that the patch introduces several new setup variables into
DefaultSettings.php. To take full advantage of the new upload behaviour, set the
following in LocalSettings.php:

$wgStrictFileExtensions = false;
$wgVerifyMimeType= true;
$wgAntivirus= "clamav"; #or "f-prot"
$wgMimeDetectorCommand= "file -bi"; #on linux

Note that if $wgMimeDetectorCommand is net set, the new code tries to use
mime_content_type or finfo to detect the mime type, if either is available.
Those extensions need to be compiled into PHP, their use has not been tested yet.

This patch has been installed and tested successfully on Wegge's test wiki
http://playwiki.wegge.dk/, you can try it out there. I hope it can be applied
to the 1.5 branch soon.

wegge wrote:

Danish translations for the new strings introduced by this patch

Besides submitting the LanguageDa.php diff, I can tell that so far the test has
given no warnings or errors from the MediaWiki installation, except for those
introduced by my forgetfullness in the patching phase. If anybody experiences
problems on http://playwiki.wegge.dk I'll be happy to investigate, given an
approximate time.

Attached:

Here's where the new files go (sorry, I should have put that into the
description of each attachment). The patches are obviously applied as patches.
The other files are, in order:

  • database patch: maintenance/archives/patch-img_media_type.sql
  • mime type module: includes/MimeMagic.php
  • mime type file: includes/mime.types
  • mime info file: includes/mime.info

that should be it :)

avarab wrote:

There are some things you change that aren't directly related to this
feature/functionality:

This should probably be enabled at some point, but isn't related to this feature:
""""
-$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg' );
+$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'ogg' );
""""

Changing the Exif read/get functions doesn't have much to do with this:
""""
@@ -1146,7 +1289,10 @@

    • @return array */ function retrieveExifData () {
  • if ( $this->type !== '2' ) return array ();

+ global $wgShowEXIF ;
+ if ( !$wgShowEXIF or !$this->fileExists) return array ();
+
+ if ( $this->getMimeType() !== "image/jpeg" ) return array ();

$exif = exif_read_data( $this->imagePath );
foreach($exif as $k => $v) {

@@ -1162,11 +1308,14 @@

                unset($exif[$k]);
        }
}

+

        return $exif;
}

function getExifData () {
  • global $wgRequest;

+ global $wgRequest, $wgShowEXIF;
+
+ if (!$wgShowEXIF or !isset($this->exif)) return array();

$purge = $wgRequest->getVal( 'action' ) == 'purge';
$ret = unserialize ( $this->metadata );

@@ -1416,7 +1565,7 @@
""""

Thera are also some stray whitespace changes:
""""
@@ -881,6 +1026,7 @@

        }
        wfPurgeSquidServers( $urls );
}

+

}
 
/**

@@ -924,16 +1070,6
""""

""""

);

+

  1. Update the current image row

""""

""""
@@ -1162,11 +1308,14 @@

                unset($exif[$k]);
        }
}

+

        return $exif;
}

""""

""""
@@ -1443,7 +1592,6 @@

return array_key_exists( $name, $titleList );

}

""""

""""

}

+

        chmod( $this->mSavedFile, 0644 );
        return true;
}

@@ -640,153 +677,207 @@
""""

""""
+
}
?>
""""

""""

}

+

readfile( $fname );

}
""""

And regarding this:

""""
@@ -1,17 +1,20 @@

  • Extra image metadata, added for 1.5

+-- NOTE: as by patch-img_media_type.sql, the img_type
+-- column is no longer used and has therefore be removed from this patch
+
ALTER TABLE /*$wgDBprefix*/image ADD (

img_width int(5) NOT NULL default 0,
img_height int(5) NOT NULL default 0,
img_bits int(5) NOT NULL default 0,
  • img_type int(5) NOT NULL default -1

+ -- img_type int(5) NOT NULL default -1
);

ALTER TABLE /*$wgDBprefix*/oldimage ADD (

oi_width int(5) NOT NULL default 0,
oi_height int(5) NOT NULL default 0,
oi_bits int(3) NOT NULL default 0,
  • oi_type int(3) NOT NULL default 0

+ -- oi_type int(3) NOT NULL default 0
);
""""
if img_type and oi_type aren't used anymore please add a DELETE statement to
patch-img_media_type.sql that removes them, and change
maintenance/parserTests.php so that it doesn't create this obsolete table.

Otherwise it looks pretty solid, I haven't really tested it on a wide range of
input files yet, which I might just do;)

avarab wrote:

Some further comments after looking at it a bit more closely:

  • The whole MimeMagic.php file is just a row of functions most of which have to

be parsed on each invocation, it should be a class that's loded on demand.

  • The whole wfInitMimeMaps() function just serves to make arrays like:

Array
(

[application/ogg] => ogg ogg
[application/pdf] => pdf pdf
[application/x-javascript] => js js
[application/x-shockwave-flash] => swf swf
[audio/midi] => mid midi kar mid midi kar
[audio/mpeg] => mpga mpa mp2 mp3 mpga mp2 mp3
<snip>

Array
(

[OFFICE] => Array
    (
        [0] => application/pdf
        [1] => application/pdf
        [2] => application/acrobat
 <snip>

Array
(

  • => text/rtf [application/x-javascript] => text/javascript [audio/mpeg] => audio/mp3 [application/ogg] => audio/ogg <snip>

This could be simplified with something like:

$mime = array(

'ogg' => array (
        'type' => AUDIO,
        'mime' => array(
                'audio/ogg',
                'application/x-ogg',
                'application/ogg',
                'audio/x-ogg'
        )
)

);

Which would then be a native php datastructure, it would load alot faster and be
a bit more understandable, the former will not really be a problem if it's
rewritten as a class and invoced only when needed, but currently the whole thing
is being parsed in full and arrays generated even when they're not needed such
as on pages that don't have images, since Image.php is loaded even then.

  • There's some usage of @ to hide errors, this is generally a bad thing™ and

should indicate that things aren't being done like they should.

(In reply to comment #16)

I'd like to point out a few things about avar's comments. In general, he has
some valid points, but with others I disagree. It would be good to have some
more input on this.

  • The whole MimeMagic.php file is just a row of functions most of which have to

be parsed on each invocation, it should be a class that's loaded on demand.

It should indeed be loaded on demand, but being a class does not help with that
(a class needs to be parsed in full, too, even if it's never instantiated). That
is, the MimeMagic.php file should only be *included* on demand (ther's even a
note about this in Image.php where it's included). This would require the
include statement to be inside a function - I know this is possible, but brings
with it other problems, like global variables defined in MimeMagic to be
declared global explicitely (otherwise they would end up in the scope of the
calling function). This is rather obscure, but could be done. I did not do that
because I expected this to raise objections (as I belive doing includes inside
functions is obscure and should be avoided).

  • The whole wfInitMimeMaps() function just serves to make arrays like:

...

This could be simplified with something like:

$mime = array(

'ogg' => array (
        'type' => AUDIO,
        'mime' => array(
                'audio/ogg',
                'application/x-ogg',
                'application/ogg',
                'audio/x-ogg'
        )
)

);

Which would then be a native php datastructure, it would load alot faster and be
a bit more understandable,

I disagree:

  • Complex nested array structures are less readable and harder to maintain (I

tried it, it was like that initially).

  • the map above would have to be process to build the reverse mappings (mime to

ext, mime to mime aliases, etc). Maintaining the reverse mappings by hand
seperately is redundant and error prone. Building several maps from a file has
the advantag of reducing redundancy and easing maintanance.

  • Also, mime.type files are a standard structure on *nix systems and it should

be possible to use the systems standard mime map (/etc/mime.types) instead of
the on provided with mediawiki.

  • The structure of mime.types and mime.info is much easier to understand and

maintain by non-programmers, and they are also more robust against syntax errors.

It is true however that native php structures would load faster. Even faster
would be serialized php structures, but they are not really maintainable by
hand. I belive this to be a non-issue, because those files are never parsed in
normal "viewing" operation, but only under three conditions:

  • a new file is uploaded
  • an old version of a file is restored
  • the entry in the image table is upgraded from 1.4 format

the former will not really be a problem if it's
rewritten as a class and invoced only when needed, but currently the whole thing
is being parsed in full and arrays generated even when they're not needed such
as on pages that don't have images, since Image.php is loaded even then.

That is not true: the reason the arrays are build inside a function is precisely
to only do that on demand. The MimeMagic.php is included but non of the detector
functions is ever called, the arrays are not initialized. Including them in the
files as native structures would, among other things, cause them to always be
parsed. Also, as I said before, if the function is wrapped in a class or not is
irrelevant to the fact that it is always parsed in full.

  • There's some usage of @ to hide errors, this is generally a bad thing™ and

should indicate that things aren't being done like they should.

The @ to hide errors is used *only* for lookups in arrays when it is not known
in advance if ther's a value for that key. The same could be done with
isdefined, if that is preferred. Supressing errors is necessary in some
situation and done throughout the code.

please comment, so I can fix issues. I'm currently inclined to do the
conditional include in Image.php, but i'll wait for more input for now.

Created attachment 525
new version of MimeMagic.php, code is now wrapped in a class; this version matches the patch that follows. (path: includes/MimeMagic.php)

attachment MimeMagic.php ignored as obsolete

Created attachment 526
new patch, using the new, OO version of MimeMagic.php

New patch that hopefully takes care of avar's concerns. Main changes are

  • MimeMagic code is now wrapped in a class. One one global instance is needed
  • To get an instance of MimeMagic, use wfGetMimeMagic(). This function also

includes the code file if neccessary.

  • I also adjusted ParserTest.php to use the new structure of the image table.

enjoy!

attachment image-p2.patch ignored as obsolete

Created attachment 527
"upgrade" patch in case you already applied the first full patch (from 2005-05-10 20:46 UTC)

attachment image-p1-p2.patch ignored as obsolete

(In reply to comment #15)

This should probably be enabled at some point, but isn't related to this feature:
""""
-$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg' );
+$wgFileExtensions = array( 'png', 'gif', 'jpg', 'jpeg', 'ogg' );

I removed ogg from the default whitelist because many web servers do not have a MIME
type set for the .ogg extension. This makes such files more susceptible to various
attacks: client-side type autodetection bugs are more easily triggered on text/plain
or application/octet-stream, and server-side type detection bugs like interpreting
files as executable scripts is easier to trigger when there's not a default file type
for it to match.

While the various other precautions make these less likely (checking for multiple
extensions and blacklisting known bad ones, checking for known client-side script
triggers, etc), I prefer a 'secure by default' philosophy. Admins can explicitly add
it to the whitelist (as we have) if it's known to be safe in the environment they've
created.

avarab wrote:

New patch that hopefully takes care of avar's concerns. Main changes are

It still changes $wgFileExtensions, the exif handling, whitespace and now has
$wgUploadDescriptionPattern (although unused) in DefaultSettings.php, please
make sure that the patches you send in only change things directly related to
this particular feature, having other changes in there makes it harder to review.

avarab wrote:

*** Bug 667 has been marked as a duplicate of this bug. ***

avarab wrote:

*** Bug 1683 has been marked as a duplicate of this bug. ***

avarab wrote:

*** Bug 1790 has been marked as a duplicate of this bug. ***

Created attachment 542
updated patch, implementing most of brions suggestions. Requires the new version of the other files, see below.

attachment image-4.patch ignored as obsolete

Created attachment 543
MIME detection module, new version (includes/MimeMagic.php)

attachment MimeMagic.php ignored as obsolete

Created attachment 544
updated mime info (includes/mime.info)

Attached:

Created attachment 545
updated mime types (includes/mime.types)

Attached:

Created attachment 546
updated database patch, now also removed the obsolete img_type column. (maintenance/archives/patch-img_media_type.sql)

Attached:

I just uploded a new version of my patch. I hope I have implemented everything
brion and avar suggested.

Major changes are:

  • Databse consistence:
    • automatically rebuild memcached image info
    • give a descriptive error if the database schema was not updated
    • delete obsolete column in the database patch
  • removed the code for disabling [[media:...]] links
  • Cleaned up changes to whitespace and code style. At least I tried.
  • Some changes to ImagePage.php - more info about non-image files, links to

source of pre-rendered images, etc.

  • Tweaked the code to also render TIFF, BMP etc if ImageMagic is available.
  • PHP code is now detected an rejected
  • cleaned up mustRender/canRender/allowInlienDisplay

Ther has been some more cleanup and minor fixed. Please have a look.

avarab wrote:

MIME detection module, new version (includes/MimeMagic.php)

The version by Daniel Kinzler had an undefined variable ($e) in
guessTypesForExtension()

attachment MimeMagic.php ignored as obsolete

avarab wrote:

MIME detection module, new version (includes/MimeMagic.php)

Fixes a typo in guessTypesForExtension() (pregt_replace => preg_replace)

attachment MimeMagic.php ignored as obsolete

avarab wrote:

attachment 542 contains alot of changes to the enotif code that tim cleaned
up/removed recently, please go over the patch manually to make sure it doesn't
revert any recent changes.

And there's a syntax error on line 258 of MimeMagic.php, you need to add slashes
around that regular expression.

avarab wrote:

Changes to Language.php, should be applied to a clean tree (not directly after attachment 542)

replaces the includes/Language.php part of attachment 542, tones down the
warning a little and removes the link to Wikipedia.

Attached:

avarab wrote:

Changes to ImagePage.php, should be applied to a clean tree (not directly after attachment 542)

replaces the includes/ImagePage.php part of attachment 542, moves the links
around so that the direct link to the file is always immitiately after the TOC
(as it was previously).

Attached:

Created attachment 554
MIME detection module (includes/MimeMagic.php), fixing bugs mentioned in #32, #33, #34

Attached:

Created attachment 555
new version of the FULL PATCH, fixing bugs mentioned in #32, #33, #34

Ok, I hope that's all.

FYI: this patch includes things mentioned in #32, #33, #34, but not the changes
by attachment 549 and attachment 550.

Attached:

avarab wrote:

Commited to HEAD, marking this as FIXED.

avarab wrote:

*** Bug 1410 has been marked as a duplicate of this bug. ***

Gilles raised the priority of this task from Medium to Unbreak Now!.Dec 4 2014, 10:29 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to Medium.Dec 4 2014, 11:22 AM