Page MenuHomePhabricator

File extensions should be automatically decided by MIME type at upload
Open, LowPublicFeature

Description

Author: johnnymrninja

Description:
Breaking this of related T34660, which was broken off of T6421. This would also solve T31284.

As MW detects the MIME type of the file as it is being uploaded, it should not rely on the uploader to provide a file extension. Rather the file type should be set automatically by the software. Any extension detected in the name should be automatically removed.

For example if Cheese.JPEG is uploaded, but the MIME type is PNG, the file should be named Cheese.png, and not Cheese.JPEG.png. If that MIME type is correct, it should simply be named Cheese.jpg. This should also create a notice for the uploader, so they don't lose track of their uploaded file.

Obviously this will not fix existing issues mentioned in the first two bugs, but it will prevent future issues.


Version: unspecified
Severity: enhancement
See Also:
T34660: File extensions for the same file type should not allow variations of a file name (File:X.jpg, File:X.jpeg, File:X.JPG should all refer to the same file)
T31284: Upload form should change file extensions to the canonical form automatically (lowercase, jpeg→jpg etc.)
T213484: Normalize file extensions (capital vs small letters; jpg vs jpeg) for new uploads on Commons
T144593: File extension changes automatically while moving ogg audio file on Commons, caused by a gadget

Details

Reference
bz40479

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:08 AM
bzimport set Reference to bz40479.
bzimport added a subscriber: Unknown Object (MLST).

johnnymrninja wrote:

Hopefully this should also prevent files with unknown or unsupported MIME types from being uploaded with a supported extension. So Trojan.XXX shouldn't be uploaded as Trojan.gif. This would mean a list of extensions that unknown MIME type uploads are checked against.

svenmanguard wrote:

This is fantastic. I recommend that you use the shortest form in all lowercase as the chosen extension (i.e. ".jpg" instead of ".jpeg" or ".JPG". This is because .jpg is the most common variant for jpegs by a great deal, and .tif is the most common varient for tiffs by something of a large-ish margin.

My one concern is the handling of .ogg and .ogv. These two can /occasionally/ but not always be used interchangeably, or at the very least, have been. We can't eliminate either, but we might (I lack the technical knowledge to tell for certain) run into problems with this.

Thanks for doing this,
Sven

Say someone uploads a file named: "esp. cute dogs.jpg"

Ignoring the fact commons probably doesn't need yet another pic of someone's puppies, the period denotes the esp is an abbreviation for especially. Under this proposal would you like us to
A) prevent the file being uploaded
B) Auto rename it to esp.jpg
C) Magically recognize the ". cute dogs" is not an extension, and let it through.

johnnymrninja wrote:

(In reply to comment #2)

This is fantastic. I recommend that you use the shortest form in all lowercase
as the chosen extension (i.e. ".jpg" instead of ".jpeg" or ".JPG". This is
because .jpg is the most common variant for jpegs by a great deal, and .tif is
the most common varient for tiffs by something of a large-ish margin.

My one concern is the handling of .ogg and .ogv. These two can /occasionally/
but not always be used interchangeably, or at the very least, have been. We
can't eliminate either, but we might (I lack the technical knowledge to tell
for certain) run into problems with this.

Thanks for doing this,
Sven

.ogg is used generically for the container format, but .ogv is designed solely
for OGG video, and .oga is solely for OGG audio. As they have separate MIME
types, there shouldn't be an issue.

The main source of conflation is that OGG audio codec is called "OGG Vorbis",
so some people assume that the extension .ogv is for that (I know I did).

Worst case, if there is some issue with OGG, or people are super-attached to
the generic extension, the MIME type can be left alone for now.

The vast majority of uploads are pictures, and I'd rather see only those issues
resolved than none at all.

johnnymrninja wrote:

(In reply to comment #3)

Say someone uploads a file named: "esp. cute dogs.jpg"

Ignoring the fact commons probably doesn't need yet another pic of someone's
puppies, the period denotes the esp is an abbreviation for especially. Under
this proposal would you like us to
A) prevent the file being uploaded
B) Auto rename it to esp.jpg
C) Magically recognize the ". cute dogs" is not an extension, and let it
through.

The software already knows which extensions belong to which MIME types, it's not magic. As ". cute dogs" is not an extension, there would be no issue. There is no reason to attack every period, only known extensions.

Even unknown extensions should be safe, as long as their MIME type is equally unknown. If the MIME type is known, it's appended. So if a JPEG is uploaded as "esp. cute dogs.dog", it would become "esp. cute dogs.dog.jpg", and the uploaded is asked if they wish to continue.

johnnymrninja wrote:

To be absolutely clear, this should only relate to extensions at the end of the
file. So "exe.gif.png.jpg" would be a fine name for a JPEG, if bizarre.

Two Comments:

  1. a Commons source of extension MIME type mismatch is the reupload feature. For example http://commons.wikimedia.org/wiki/File:Grb-Pozarevac.jpg was uploded as jpg and than someone reupload a gif over it. I guess reupload should not allow use of other MIME types and offer to upload it under a new name.
  2. See http://commons.wikimedia.org/wiki/User:Dispenser/sandbox for examples of 1,625 other files with extension mismatch found on Commons.

Jarek, Commons currently blocks you from uploading most files with a wrong mime type.

But it does not block me from uploading (or reuploding) MIME:JPG file with .PNG extension, like http://commons.wikimedia.org/wiki/File:TPR2011.png uploaded this March.

(In reply to comment #9)
I can't reproduce this behavior.

I just tried and I can not reproduce it either. I tried new upload with extension mismatch and reupload. I guess someone fixed it since March when http://commons.wikimedia.org/wiki/File:TPR2011.png was uploaded. Status: Fixed?

(In reply to comment #11)

Status: Fixed?

I think what is fixed is the reupload conflicts, not this bug which deals with first-time upload.

johnnymrninja wrote:

Just to summarize (got a bit off-track up there):

1.We would maintain a list of accepted mime types and their preferred file extension.
2.Files would automatically receive an extension based on their mime type.
3.Files that are uploaded with known extensions that do no match would be renamed after a prompt ("Renaming to 'Dog.gif'. Do you wish to continue?")
4.File names would not be otherwise modified. If a file is named "dog.gif.png" and it is a JPEG, it would be renamed "dog.gif.jpg". If it was named "dog.gif.cat", it would be uploaded as "dog.gif.cat.jpg".

For the purposes of this bug, the only things that would have to be modified are the file uploader, and file renaming/moving. This would not change how files are displayed or used, or even the nature of the filename. File redirects could still be manually created at these other extensions. It would just reduce the options at the time of upload, and potentially make other bugs easier to fix in the future.

Is there anyone willing to theorize on how doable this is as a bug?

I think this can be closed as RESOLVED-DUPLICATE of bug 40326

bug 40326 seems a different bug. Comment 13 summary seems correct but I would only change the uploader. Renaming is a more manual process, and I am sure there will be cases where there's a desire to override that detection.

My only suggestion here is that, if a filename has a different MIME type to its suggested extension, surely that should be enough for an "are you sure?" prompt first, as it might well be that the uploader is uploading the wrong file.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:14 AM