Page MenuHomePhabricator

Upload verification-error possibly triggered by EXIF
Closed, DuplicatePublic

Description

I have run into a problem uploading photographs from the official Flickrstream of Coconino National Forest https://www.flickr.com/people/coconinonationalforest. There are over 3,000 public domain photographs.

An example repeated reject is https://www.flickr.com/photos/coconinonationalforest/34010878752/. The description text includes two links, these would normally be accepted as part of the Commons image page text, however the same text and links appears in the EXIF data for the image.

Are these likely to be triggering the filter? The same approach of putting the description in the EXIF looks consistent for the stream. Is there a way of by-passing this without changing the EXIF data (which would give us later verification problems), or making the blacklist filter more intelligent?

The error returned by pywikibot is:

verification-error: This file contains HTML or script code that may be erroneously interpreted by a web browser. [details:[u'uploadscripted']; help:See https://commons.wikimedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.]

''Update''

Though I continue to see upload failures, some have been uploaded. I'm uncertain what the difference is. Example of upload working, with a similar EXIF is:
https://commons.wikimedia.org/wiki/File:Red_Mountain_Trail_No._159_(28807720875).jpg

Event Timeline

Fae created this task.Apr 21 2017, 7:07 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2017, 7:07 AM
Fae updated the task description. (Show Details)Apr 21 2017, 7:07 AM
Fae updated the task description. (Show Details)Apr 21 2017, 11:12 AM
Fae added a comment.Apr 21 2017, 12:41 PM

Yes, the inconsistency is worrying. However I'm also concerned that the recommended "fix" is slightly stupid from the GLAM uploads perspective. I am not going to tamper with perfectly okay original EXIF data, that matches the EXIF data in external archives, just because on Commons we invented an arbitrary and non-intelligent filter.

The filter needs to do better than this, we are losing out on great public domain content.

Restricted Application added a project: Multimedia. · View Herald TranscriptApr 21 2017, 1:19 PM

These weird checks exist because of an IE 5-7 misfeature where it would sometimes ignore the Content-Type indicated by the server, and instead treat things as HTML. In some cases that would lead to XSS vulnerabilities. Documentation of this is sparse because most sites stopped bothering to detect these cases ten years ago or so (thus leaving IE 5-7 users vulnerable to those attacks). Our code for it was introduced in 2004 by Brion Vibber: rSVN5580 (it changed several times afterwards, but this is the first version). We apparently later widened them to handle a similar misfeature in some version of Safari. (And there's apparently an additional, more precise check using the IEContentAnalyzer library developed back in the day by Tim Starling, which probably causes fewer false positives.)

I'm pretty sure that these days, it's already impossible to view Wikipedia in IE 5-7, since we require HTTPS with some modern encryption methods. No idea about the Safari problem (I don't even know which versions that applies to). Perhaps we could disable these checks – you would probably need to get the Security team to investigate some ancient history and make a decision. Looks like some minor development would be needed too, since it looks like right now, they can only be disabled together with similar checks for SVG files (which are still required, because SVG files can actually contain scripts).

Though I continue to see upload failures, some have been uploaded. I'm uncertain what the difference is. Example of upload working, with a similar EXIF is:
https://commons.wikimedia.org/wiki/File:Red_Mountain_Trail_No._159_(28807720875).jpg

We only check the first 1024 bytes of the file. (This matches the behavior of the IE misfeature.) This example has a longer description, so the <a href only appears at byte 1218 and therefore is not a problem.

matmarex removed a subscriber: matmarex.Apr 24 2017, 5:36 PM
Tgr added a subscriber: Tgr.Apr 24 2017, 6:16 PM

According to Netrenderer IE 5-6 is broken but IE7 works fine with Wikipedia.

Revent added a subscriber: matmarex.May 8 2017, 9:10 AM

@matmarex That's truly esoteric.

zhuyifei1999 moved this task from Incoming to Uploading on the Commons board.Jul 25 2017, 9:39 AM
Ramsey-WMF triaged this task as Normal priority.Nov 28 2017, 8:23 PM
Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.