Page MenuHomePhabricator

Figure out what causes weird false positive for CheckIfDead
Closed, ResolvedPublic5 Estimated Story Points

Description

For some reason
http://www.eonline.com/au/news/386489/2013-grammy-awards-winners-the-complete-list
typically (but not always) returns a 405 Method Not Allowed error when accessed from curl, but returns 200 OK when accessed from a web browser. We should figure out why this happens as it will probably be true of other sites as well.

We should try various curl options, make sure we are handling cookies appropriately, etc.

Event Timeline

DannyH triaged this task as Medium priority.Apr 18 2016, 5:33 PM
DannyH subscribed.
kaldari raised the priority of this task from Medium to High.Apr 18 2016, 5:35 PM
DannyH set the point value for this task to 5.Apr 18 2016, 5:36 PM

As per a quick investigation by Bryan, the server does not support HEAD requests and gets confused when faced with one. One possible solution is to try a curl for full body when a head curl returns 405.

$ curl --head --silent --location http://www.eonline.com/au/news/386489/2013-grammy-awards-winners-the-complete-list
HTTP/1.1 405 Method Not Allowed
Server: Apache/2.2.3 (CentOS) mod_jk/1.2.32 PHP/5.1.6
Allow: GET
Content-Language: en
Content-Length: 1090
Content-Type: text/html;charset=ISO-8859-1
Cache-Control: max-age=60
Expires: Mon, 18 Apr 2016 17:49:41 GMT
Date: Mon, 18 Apr 2016 17:48:41 GMT
Connection: keep-alive
Set-Cookie: adEdition=us; expires=Tue, 19-Apr-2016 17:48:41 GMT; path=/; domain=.eonline.com
Set-Cookie: geoEdition=us; expires=Tue, 19-Apr-2016 17:48:41 GMT; path=/; domain=.eonline.com
Vary: User-Agent

Dropping the --head option gives a 200 OK response. This indicates that the server at www.eonline.com doesn't support the HEAD verb. I think that this is actually not going to be uncommon in general. The use of HEAD requests by the bot is a good idea generally, but it should probably treat any 4xx other than 404 and any 5xx response code to HEAD requests as a signal to try a normal GET request.

The GET should only be done if HEAD proves to be fruitless. If a server doesn't support head, it would return a 405, so why consider a 5xx or 4xx codes that aren't 404s or 405s, to be a bad HEAD request? I'm a little confused.

The GET should only be done if HEAD proves to be fruitless. If a server doesn't support head, it would return a 405, so why consider a 5xx or 4xx codes that aren't 404s or 405s, to be a bad HEAD request? I'm a little confused.

Reliance on the 405 status code is assuming that the backing server actually uses the proper HTTP status code for its response. I'm pretty sure that you are going to run into lots of them that do not. I think it should be safe to assume that a 404 response to a HEAD request means that the requested resource does not exist, but I don't think it is safe to assume that only 405 will be used to signal "OMG I never read the HTTP spec and I have no idea what to do with the HEAD verb".

Just one minor suggestion (inline at GitHub).

When I modified Niharika's code, while it's generating the correct output, the tests for some reason fail.