Page MenuHomePhabricator

*.archives.gov in wgCopyUploadsDomains allowlist doesn't work as URLs are HTTP 302 redirects to s3.amazonaws.com
Closed, ResolvedPublic

Description

Analysis:
The UploadFromUrl class prepends the body content of the redirect response together with the actual file content, which corrupts the file and causes an the MIME type to be detected as text/html (arguably correctly, since a client may indeed interpret the contents as HTML).

In order to fix this, the code responsible for saving the data chunks into a temporary file needs to be aware of the redirect, and reset the file contents before receiving the actual file contents.

Original report:
*.archives.gov is already in the whitelist. However, every file I attempt to upload from this domain gives the following error:

Cannot upload this file because Internet Explorer would detect it as "text/html", which is a disallowed and potentially dangerous file type.

There does not appear to be anything wrong with the files. I am guessing this is caused by the fact that archives.gov media is actually all stored in s3, and this is just an alias domain.

For example, the URL https://catalog.archives.gov/catalogmedia/lz/riverside/rg-075/561617/561617-3-07-01.jpg actually resolves to https://s3.amazonaws.com/NARAprodstorage/lz/riverside/rg-075/561617/561617-3-07-01.jpg

How can we correctly allow these files? Is this a bug in the domain whitelist, or can we resolve it by simply adding the s3 bucket path in some way? It seems like this would be s3.amazonaws.com/NARAprodstorage/, but none of the current whitelist examples are showing a URL path beyond the (sub)domain, so I'm not sure if that works?

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

Aklapper renamed this task from A fix for *.archives.gov in wgCopyUploadsDomains allowlist to *.archives.gov in wgCopyUploadsDomains allowlist doesn't work as URLs are HTTP 302 redirects to s3.amazonaws.com.Jul 16 2020, 8:21 AM

I'm afraid this is hard to solve... IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon - that would be equivalent to whitelisting anything.

Urbanecm changed the task status from Open to Stalled.Jul 19 2020, 5:52 PM

IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon

@Urbanecm: In that case it's either not stalled but declined, or there should be a subtask to implement URL fragments support in $wgCopyUploadsDomains?

I thought I created it... Seems I didn't, I'll do that when I return from lunch :).

I'm afraid this is hard to solve... IIRC it's impossible to whitelist something smaller than a domain, and we're definitely not going to whitelist Amazon - that would be equivalent to whitelisting anything.

Yes, definitely understand that consideration. I'm not sure how common it is, but I imagine there are other large image hosts using s3 (or similar products) in this way. Is an alternative approach to make the downloader properly follow any URL redirects before attempting download (i.e., check for 302 and download the target URL instead if so)? Assuming the URL redirect has been put in place by the trusted site in the whitelist, this seems safe, and would not require users to request a new whitelist entry. And also, since it's not their public domain, I am not sure we can count on organizations not to do things like changing names to buckets or having multiple.

AMooney changed the task status from Stalled to Open.Aug 25 2020, 12:44 PM
AMooney raised the priority of this task from Low to Medium.

@holger.knust can you take a look at this?

The issue is what gets stored here. Since the initial request is answered with a 302 response, the downloader saves the HTML portion of the response and then appends the actual image. When the file is evaluated for its MIME type, it is classified as text/html because of the initial 302 portion.

So, this is essentially a bug in how we handle redirects, which results in incorrect mime type detection? Does it have anything to do with the domain whitelist at all?

No, it does not since it is not checked during redirects.

Change 626351 had a related patch set uploaded (by Holger Knust; owner: Holger Knust):
[mediawiki/core@master] WIP: Manually redirect in UploadFromUrl

https://gerrit.wikimedia.org/r/626351

Change 626351 merged by jenkins-bot:
[mediawiki/core@master] Manually redirect in UploadFromUrl

https://gerrit.wikimedia.org/r/626351