Page MenuHomePhabricator

upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404
Open, LowPublic

Description

We've had a surge of 5xx lately, investigating it I found a bunch of 501s ("method not implemented") for truncated URLs from an external referer

{
  "hostname": "cp4014.ulsfo.wmnet",
  "sequence": 6204195515,
  "dt": "2015-07-22T12:34:33",
  "time_firstbyte": 0.078635931,
  "ip": "10.128.0.114",
  "cache_status": "miss",
  "http_status": "501",
  "response_size": 235,
  "http_method": "GET",
  "uri_host": "upload.wikimedia.org",
  "uri_path": "/wikipedia/th/thumb/3/36/%E0%B9%82%E0%B8%A5%E0%B9%82%E0%B8%81%E0%B9%89%E0%B8%AA%E0%B9%82%E0%B8%A1%E0%B8%AA%E0%B8%A3%E0%B8%9F%E0%B8%B8%E0%B8%95%E0%B8%9A%E0%B8%AD%E0%B8%A5%E0%B8%AD%E0%B8%B2%E0%B8%A3%E0%B9%8C%E0%B8%A1%E0%B8%B5%E0%B9%88_%E0%B8%A2%E0%B8%B9%E0%B9%84%E0%B8%99%E0%B9%80%E0%B8%95%E0%B9%87%E0%B8%94.jpg/250px-%E0%B9%82%E0%B8%A5%E0%B9%82%E0%B8%81%E0%B9%89%E0%B8%AA%E0%B9%82%E0%B8%A1%E0%B8%AA%E0%B8%A3%E0%B8%9F%E0%B8%B8%E0%B8%95%E0%B8%9A%E0%B8%AD%E0%B8%A5%E0%B8%AD%E0%B8%B2%E0%B8%A3%E0%B9%",
  "uri_query": "",
  "content_type": "text/html; charset=UTF-8",
  "referer": "http://www.xn--72c0as5bd1c5b2byj.com/",
  "x_forwarded_for": "171.96.167.13",
  "user_agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36",
  "accept_language": "en-US,en;q=0.8",
  "x_analytics": "-",
  "range": "-",
  "x_cache": "cp1050 miss (0), cp4014 miss (0), cp4014 frontend miss (0)"
}

The correct behaviour would've been to issue a 404 I think.

Event Timeline

Joe created this task.Jul 22 2015, 12:52 PM
Joe raised the priority of this task from to Low.
Joe updated the task description. (Show Details)
Joe added projects: acl*sre-team, Varnish.
Joe added a subscriber: Joe.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 22 2015, 12:52 PM
jcrespo added a subscriber: jcrespo.

The main issue here, in my opinion, is the log noise this creates, then; associating project.

Tgr added a subscriber: Tgr.Aug 25 2015, 2:56 AM

thumb.php does not ever return 501 as far as I can see. I would expect a truncated URL to be caught by the parameter check which returns a 400. Does this error come from Varnish somehow? It does not seem sensible to reply with 501 to a GET request.

Joe added a comment.Aug 25 2015, 7:59 AM

@Tgr yes this problem is at the varnish level.

Tgr added a comment.EditedOct 9 2015, 10:08 PM

(Image in question: link)

Maybe it's from MediaViewer's thumbnail URL guessing? That would only happen though if MediaViewer is opened by clicking on a full-size image and then it tries to fetch a thumbnail, and I can't really see when that would happen. If you can check the request headers, you can recognize MediaViewer requests because they have an origin. Alternatively, can you tell what the referer and user agent are?

Aklapper renamed this task from upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 to upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404.Oct 13 2015, 11:32 AM
Aklapper set Security to None.
Aklapper added subscribers: Steinsplitter, Base, intracer.
Base added a comment.Oct 13 2015, 11:53 AM

@jcrespo could you kindly point to where the correct usage of the API you are talking about is documented. I was trying to find something on mww yesterday on encountering the task merged but failed.

I would say we should redirect to correct url instead of 400.

@jcrespo could you kindly point to where the correct usage of the API you are talking about is documented. I was trying to find something on mww yesterday on encountering the task merged but failed.

The imageinfo module of the api. You should ask it about the thumbnail url for a specific file.

Tgr added a comment.Oct 13 2015, 6:46 PM

See FileRepo::nameForThumb() for how the thumbnail file name (the part after the /) is generated. IIRC abbrvThreshold is 200 for Wikimedia sites.

But as Bawolff says you should use the API unless you are super interested in performance (and if you don't use it, you are going to have harder problems than this).

intracer added a comment.EditedOct 15 2015, 3:57 AM

Just did a quick test with one thumbnail - you spend about 200ms to get uncached thumbnail, 40ms to get cached thumbnail and about 200ms to get one thumbnail URL via API.
https://en.wikipedia.org/w/api.php?action=query&titles=File:Albert_Einstein_Head.jpg&prop=imageinfo&&iiprop=url&iiurlwidth=220
Doesn't seem much even though response time increases by 2-6 times.
Another question is how many thumbnail URLs queried in a batch can give a noticeable delay

Tgr added a comment.EditedOct 15 2015, 7:04 AM

Yes, the imageinfo delay for the median user is around 200 ms (or was when we last measured it - that was a while ago). If you show an interactive interface, that's nontrivial; if you do some kind of batch processing there is no reason to care.

See FileRepo::nameForThumb() for how the thumbnail file name (the part after the /) is generated. IIRC abbrvThreshold is 200 for Wikimedia sites.

160 for commons at least
https://gerrit.wikimedia.org/r/#/c/168239/1/wmf-config/filebackend.php

intracer added a comment.EditedNov 12 2015, 6:18 AM

Don't know if it's correct place to write. We assumed that we can use images directly from commons on external websites, such as Wikimedia Ukraine blog, Wiki Loves Earth/Wiki Loves Monuments blog. Maybe other sites can assume this and try to use images directly from commons too.

Now I see with thumbor, hashes URLs can change very much. Maybe it's really needed, and maybe it's not safe to assume the stable URL.
However for articles I don't like sites that change their URLs so references in Wikipedia articles, other sites does not work anymore.

Maybe Commons should clearly state that you can freely use images from Commons unless you insert them directly by URL?

ema moved this task from Triage to Caching on the Traffic board.Sep 30 2016, 3:14 PM