Page MenuHomePhabricator

Click on fullImageLink <a> for PDF on File: page no longer rendering in browser
Closed, ResolvedPublic

Description

It appears that a click on the main thumbnail for PDFs is spawning a file download dialog box instead of rendering inside PDF-rendering capable desktop UAs.

Something seems amiss with the content-type header. To reproduce, visit https://commons.wikimedia.org/wiki/File:Contributing_to_MediaWiki_Extensions.pdf and click on the thumbnail.

Expected: when I click on the thumbnail the PDF should load inline in the browser if it's a supported file type.

Some headers from curl:

$ curl --head https://upload.wikimedia.org/wikipedia/commons/9/91/Colombian_Air_Force_Sikorsky_UH-60L_Arp%C3%ADa_III_%28S-70A-41%29_Ram%C3%ADrez-1.jpg 2>/dev/null | grep 'content-type'

content-type: image/jpeg

$ curl --head https://upload.wikimedia.org/wikipedia/commons/d/da/Contributing_to_MediaWiki_Extensions.pdf 2>/dev/null | grep 'content-type'

content-type: application/x-www-form-urlencoded

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Aklapper I'm unsure if it's PdfHandler. @matmarex it has a header match, but I'm not sure - when I spot checked yesterday it seemed like it was mainly PDFs, not images, but to be fair it's hard to say. Looping in @MarkTraceur @MaxSem @Tgr @brion for ping. Anybody have ideas on what might be going on and how to resolve?

PdfHandler generates thumbnails; I don't think it interacts with the download of originals in any way.

Broken in swift, or upstream from there:

tgr@terbium:~$ curl -vso /tmp/foo -H 'Host: upload.wikimedia.org' 'http://ms-fe1005.eqiad.wmnet/wikipedia/commons/d/da/Contributing_to_MediaWiki_Extensions.pdf' |& grep -i content-type
< Content-Type: application/x-www-form-urlencoded

Quite possible, @fgiunchedi can check. He already noticed today that some objects on codfw hadn't been cleaned up when their eqiad counterparts had. It's possible that the cleanup job failed silently for some objects.

AFAIUI this issue is related to a wrong c-t header either being sent at upload time or changed afterwards, IOW not related to the x-content-dimensions cleanup.

swift doesn't mangle/change c-t as far as I know, so the header is wrong either at upload time or changed afterwards. application/x-www-form-urlencoded suggests a "form upload" from a browser, so perhaps depending on the uploader's browser the results change?

Initially I thought this could be time-related (i.e. older files get changed by some background process) but that doesn't seem to be the case, e.g:

And both have been uploaded at roughly the same time. All things being equal what I can think of is either something user side like the browser or something specific to the file itself.

Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.

Also note that as discovered in T131012: SVGs without XML prolog (<?xml ... ?>) served with Content-Type text/html from upload.wikimedia.org it looks like filebackend does its own mime detection, so that could play a part too

Ramsey-WMF subscribed.

Assigning to Cormac for further investigation.

The problem was likely caused by the maintenance script refreshFileHeaders.php being run for T175689, which would have caused the stored Content-Type header in swift to be overwritten with 'application/x-www-form-urlencoded' for all refreshed files

Here's how:

  • refreshFileHeaders.php selects rows from the image table in the db
  • it loops through them, and calls File::getContentHeaders() on each - with the change for T175689 this would have returned [ 'X-Content-Dimensions' => '' ] for each file
  • because File::getContentHeaders() returns a non-empty array, SwiftFileBackend::doDescribeInternal() is called for each relevant file
  • doDescribeInternal() merges the existing headers for the file (read via a HEAD request to swift) with the new X-Content-Dimensions header, sanitizes them and POSTs to swift
  • but SwiftFileBackend::sanitizeHdrs() removes the Content-Type header from the header array, and because the header is not set libcurl automatically adds 'Content-Type: application/x-www-form-urlencoded' to the POST request to swift
  • ... which overwrites the Content-Type header for the file in swift

The solution is to either remove the santization of the Content-Type header, or to add a method that sanitizes the other headers but leaves Content-Type alone

Any thoughts @Gilles @fgiunchedi @aaron ?

Is it possible to avoid libcurl adding that header automatically? Presumably if that header wasn't set in the update call, it wouldn't be touched by Swift.

It won't help with fixing the broken data, but I think it would be the most future-proof thing to do. The existing code has the right intent, i.e. don't mess with existing headers unintentionally.

Is it possible to avoid libcurl adding that header automatically? Presumably if that header wasn't set in the update call, it wouldn't be touched by Swift.

Correct, if the header isn't set it won't be touched by Swift.

It won't help with fixing the broken data, but I think it would be the most future-proof thing to do. The existing code has the right intent, i.e. don't mess with existing headers unintentionally.

+1

Seems to work fine to me:

>>> $ch = curl_init();
=> curl resource #188
>>> curl_setopt($ch, CURLOPT_URL, 'https://httpbin.org/post');
=> true
>>> curl_setopt($ch, CURLOPT_POST, true);
=> true
>>> curl_setopt($ch, CURLOPT_POSTFIELDS, []);
=> true
>>> curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type:']);
=> true
>>> curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
=> true
>>> curl_exec($ch)
=> """
   {\n
     "args": {}, \n
     "data": "", \n
     "files": {}, \n
     "form": {}, \n
     "headers": {\n
       "Accept": "*/*", \n
       "Connection": "close", \n
       "Content-Length": "0", \n
       "Host": "httpbin.org"\n
     }, \n
     "json": null, \n
     "origin": "<redacted>", \n
     "url": "https://httpbin.org/post"\n
   }\n
   """

We should be mindful of the Swift post_as_copy option when set to false. At the moment that does *not* allowing changing Content-Type via POST.

The default is true but may change to false in newer releases. I don't see us changing that in puppet, so atm it would work for us if sanitizeHdrs() no longer blacklisted Content-Type (for all callers).

So, the post_as_copy = true case works if SwiftFileBackend to no longer blacklist Content-Type from non-PUTs. It would always re-assert the old value if nothing was passed in by the describe() caller.

I think we can test the failure mode of post_as_copy = false with Content-Type on POST by using a VM (maybe reconfigure deployment-prep swift). If the header is simple ignored, then we can change SwiftFileBackend to no longer blacklist it from non-PUTs just to avoid curl setting a default and breaking the post_as_copy case. That would have both cases work. On the other hand, if the post_as_copy = false case errors out, then this won't work.

@Tgr here's the script I used

<?php
  
$s = curl_init();
curl_setopt($s,CURLOPT_URL,'http://127.0.0.1:8080/');
curl_setopt($s,CURLOPT_RETURNTRANSFER,true);
curl_setopt($s,CURLOPT_POST,true);
curl_setopt($s,CURLOPT_POSTFIELDS,[]);
curl_setopt($s,CURLOPT_HTTPHEADER,['Content-Type:']);
$x = curl_exec($s);
echo $x;
curl_close($s);

The destination url just had

<?php
var_dump(getallheaders());exit;

and the output is

array(5) {
  'Authorization' =>
  string(0) ""
  'Content-Type' =>
  string(0) ""
  'Accept' =>
  string(3) "*/*"
  'Host' =>
  string(14) "127.0.0.1:8080"
  'Content-Length' =>
  string(1) "0"
}

I don't see what we're doing differently :/

Actually - I tried sending 'Content-Type:' and it results in swift not updating the content type (equivalent to not sending the header) but this behaviour isn't documented as far as I can see (https://developer.openstack.org/api-ref/object-store/index.html) so I guess it's best not to rely on it

Change 392413 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/core@master] Do not strip Content-Type header for POST requests to swift

https://gerrit.wikimedia.org/r/392413

@aaron I tested post_as_copy = false on my local VM (updated it in /etc/swift/proxy-server.conf) and it works ok ... can't be 100% sure I'm doing it correctly though

The patch prevents new bad data being written, but I'm not sure how to go about fixing the existing bad data.

If we had a log of every time refreshFileHeaders.php was run we could probably deduce from that the files that have been written to wrongly by that script, but that probably won't catch everything. Any thoughts/suggestions?

Hopefully anyone running it would have mentioned it in SAL. @Gilles do you know exactly which files you updated?

All potentially multipage documents on all wikis were processed: PDFs, DJVUs, TIFFs. The migration was run one file type at a time, using the media_type, major_mime and minor_mime parameters in refreshFileHeaders for efficient underlying SQL queries.

Added a change to refreshFileHeaders.php into the patch so that it can be used to repair the Content-Type header in swift (by reconstructing it from img_major_mime and img_minor_mime in the database)

Change 393603 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/core@master] Allow file headers to be refreshed from database

https://gerrit.wikimedia.org/r/393603

Change 393603 merged by jenkins-bot:
[mediawiki/core@master] Allow file headers to be refreshed from database

https://gerrit.wikimedia.org/r/393603

Change 392413 merged by jenkins-bot:
[mediawiki/core@master] Do not strip Content-Type header for POST requests to swift

https://gerrit.wikimedia.org/r/392413

@Gilles @fgiunchedi I think I've done all I can from a dev perspective on this now - someone on the ops side will have to run the refreshFileHeaders.php script on production with the --refreshContentType argument against the appropriate files. Is there something I need to do to make that happen?

@Cparle you should be able to do that yourself, on terbium.eqiad.wmnet See this guide if you don't have shell access yet.

Once you've SSHed into that machine, create a named screen to hold the long-running script and just call "mwscript maintenance/refreshFileHeaders.php commonswiki" followed by the parameters you need. You might have to write a shell script to run that on all wikis. You can check out /home/gilles/T175689.sh on there, which is what I used when I ran the previous cleanup. The commented out section is what you need to run it over a list of wikis stored in a file (that file is also in my home directory).

Let me know if you get stuck, it'd be good if you go through the motions, it's a very useful thing to be able to run maintenance scripts on production yourself.

You might have to write a shell script to run that on all wikis.

If you don't particularly care about the different runs for the same wiki happening after each other, you can also just run foreachwikiindblist somewikilist.dblist maintenance/refreshFileHeaders.php <parameters> (which just calls mwscript maintenance/refreshFileHeaders.php --wiki=$DB <parameters> for each database ID in the list + adds some extra output to make it easy to tell the wikis apart).

Would it make sense to run this script automatically via a puppetized cronjob? Or does it need manual interaction? I mean it would always be good to reduce the need for manual things and just automate it and i'd be happy to help getting another cron on maintenance servers.

We only want this to run once. I've never felt the need to automate this sort of job, starting a screen and running the command is very simple.

starting a screen and running the command is very simple.

How long do these screen sessions typically run? I am asking because if they for more than like a day they will cause alerts in Icinga. (though we already whitelisted maintenance servers, so nevermind :)

It usually takes days, yes. Would leaving a message in the SAL be enough warning?

IIRC @jcrespo asked that long-running scripts using the DB be mentioned on the Deployments page. Or is that only about the ones which do writes?

I think the bottleneck here is Swift being written to, the DB queries being read would be negligible DB load, IMHO.

If they do long-running SELECTs those are important for us, too, as they block DMLs. Also if the script keeps a single connection open and does not close it, because it blocks depooling that server. Adding things to deployment never hurts, it helps us be aware of ongoing jobs.

Ok. It won't take less than an hour though, it'll take days

Is that ok @jcrespo ? Is a one-liner still ok for a job that will probably take a week (or longer)?

@Cparle you put it on "Week of:" for each week that it will be running on. That will give us a heads up that there may be activity there and we will double check before doing incompatible maintenance (like a schema change or something else). We do the same precisely to check/warn about that.

Note there are weeks with no deployments allowed, to allow extending over those, you would ask for permission to Release Engineering- or pause over those weeks.

Ok it's not very urgent, so I've put in on the page for the week of Jan 8

Mentioned in SAL (#wikimedia-operations) [2018-01-08T16:28:29Z] <cormacparle> About to run refreshFileHeaders.php on all wikis to fix https://phabricator.wikimedia.org/T178849