*.archives.gov in wgCopyUploadsDomains allowlist doesn't work as URLs are HTTP 302 redirects to s3.amazonaws.com
Closed, ResolvedPublic
Actions

Description

Analysis:
The UploadFromUrl class prepends the body content of the redirect response together with the actual file content, which corrupts the file and causes an the MIME type to be detected as text/html (arguably correctly, since a client may indeed interpret the contents as HTML).

In order to fix this, the code responsible for saving the data chunks into a temporary file needs to be aware of the redirect, and reset the file contents before receiving the actual file contents.

Original report:
*.archives.gov is already in the whitelist. However, every file I attempt to upload from this domain gives the following error:

Cannot upload this file because Internet Explorer would detect it as "text/html", which is a disallowed and potentially dangerous file type.

There does not appear to be anything wrong with the files. I am guessing this is caused by the fact that archives.gov media is actually all stored in s3, and this is just an alias domain.

For example, the URL https://catalog.archives.gov/catalogmedia/lz/riverside/rg-075/561617/561617-3-07-01.jpg actually resolves to https://s3.amazonaws.com/NARAprodstorage/lz/riverside/rg-075/561617/561617-3-07-01.jpg

How can we correctly allow these files? Is this a bug in the domain whitelist, or can we resolve it by simply adding the s3 bucket path in some way? It seems like this would be s3.amazonaws.com/NARAprodstorage/, but none of the current whitelist examples are showing a URL path beyond the (sub)domain, so I'm not sure if that works?

Details

	Subject	Repo	Branch	Lines +/-
	Manually redirect in UploadFromUrl	mediawiki/core	master	+90 -22

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	None	T60224 Add domains to $wgCopyUploadsDomains (tracking)
Resolved	• holger.knust	T258122 *.archives.gov in wgCopyUploadsDomains allowlist doesn't work as URLs are HTTP 302 redirects to s3.amazonaws.com
		· · ·

Event Timeline

Dominicbm created this task.Jul 16 2020, 1:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2020, 1:34 AM

RhinosF1 subscribed.Jul 16 2020, 5:55 AM

Aklapper renamed this task from A fix for *.archives.gov in wgCopyUploadsDomains allowlist to *.archives.gov in wgCopyUploadsDomains allowlist doesn't work as URLs are HTTP 302 redirects to s3.amazonaws.com.Jul 16 2020, 8:21 AM

taavi subscribed.Jul 17 2020, 7:10 PM

I'm afraid this is hard to solve... IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon - that would be equivalent to whitelisting anything.

Urbanecm changed the task status from Open to Stalled.Jul 19 2020, 5:52 PM

Urbanecm moved this task from Backlog to Analysis / under discussion on the Wikimedia-Site-requests board.

Urbanecm triaged this task as Low priority.Jul 19 2020, 6:25 PM

Urbanecm moved this task from Incoming to Uploading (upload_by_url and server side upload) on the Commons board.Jul 19 2020, 6:39 PM

In T258122#6317543, @Urbanecm wrote:

IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon

@Urbanecm: In that case it's either not stalled but declined, or there should be a subtask to implement URL fragments support in $wgCopyUploadsDomains?

I thought I created it... Seems I didn't, I'll do that when I return from lunch :).

In T258122#6317543, @Urbanecm wrote:

I'm afraid this is hard to solve... IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon - that would be equivalent to whitelisting anything.

Yes, definitely understand that consideration. I'm not sure how common it is, but I imagine there are other large image hosts using s3 (or similar products) in this way. Is an alternative approach to make the downloader properly follow any URL redirects before attempting download (i.e., check for 302 and download the target URL instead if so)? Assuming the URL redirect has been put in place by the trusted site in the whitelist, this seems safe, and would not require users to request a new whitelist entry. And also, since it's not their public domain, I am not sure we can count on organizations not to do things like changing names to buckets or having multiple.

@holger.knust can you take a look at this?

• AMooney moved this task from Inbox to Later on the Platform Team Workboards (Clinic Duty Team) board.Aug 25 2020, 1:38 PM

FRomeo_WMF subscribed.Aug 25 2020, 4:02 PM

• holger.knust claimed this task.Aug 25 2020, 7:23 PM

• holger.knust moved this task from Later to Doing(WIP:5) on the Platform Team Workboards (Clinic Duty Team) board.Aug 26 2020, 2:06 PM

The issue is what gets stored here. Since the initial request is answered with a 302 response, the downloader saves the HTML portion of the response and then appends the actual image. When the file is evaluated for its MIME type, it is classified as text/html because of the initial 302 portion.

So, this is essentially a bug in how we handle redirects, which results in incorrect mime type detection? Does it have anything to do with the domain whitelist at all?

No, it does not since it is not checked during redirects.

Change 626351 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[mediawiki/core@master] WIP: Manually redirect in UploadFromUrl

https://gerrit.wikimedia.org/r/626351

gerritbot added a project: Patch-For-Review.Sep 10 2020, 11:00 AM

daniel updated the task description. (Show Details)Sep 10 2020, 11:56 AM

daniel moved this task from Doing(WIP:5) to Waiting for Review on the Platform Team Workboards (Clinic Duty Team) board.Sep 10 2020, 7:34 PM

daniel updated the task description. (Show Details)

Kizule mentioned this in T262749: Add s3.amazonaws.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons.Sep 13 2020, 11:34 AM

Kizule merged a task: T262749: Add s3.amazonaws.com to the wgCopyUploadsDomains allowlist of Wikimedia Commons.

Kizule added subscribers: AlbanGeller, Nintendofan885, Kizule.

Change 626351 merged by jenkins-bot:
[mediawiki/core@master] Manually redirect in UploadFromUrl

https://gerrit.wikimedia.org/r/626351

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.10; 2020-09-22).Sep 15 2020, 6:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 15 2020, 6:11 PM