Figure out what causes weird false positive for CheckIfDead
Closed, ResolvedPublic5 Story Points

Description

For some reason
http://www.eonline.com/au/news/386489/2013-grammy-awards-winners-the-complete-list
typically (but not always) returns a 405 Method Not Allowed error when accessed from curl, but returns 200 OK when accessed from a web browser. We should figure out why this happens as it will probably be true of other sites as well.

We should try various curl options, make sure we are handling cookies appropriately, etc.

kaldari created this task.Apr 18 2016, 5:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2016, 5:30 PM
DannyH triaged this task as Normal priority.Apr 18 2016, 5:33 PM
DannyH added a subscriber: DannyH.
kaldari raised the priority of this task from Normal to High.Apr 18 2016, 5:35 PM
DannyH set the point value for this task to 5.Apr 18 2016, 5:36 PM

As per a quick investigation by Bryan, the server does not support HEAD requests and gets confused when faced with one. One possible solution is to try a curl for full body when a head curl returns 405.

bd808 added a subscriber: bd808.EditedApr 18 2016, 5:58 PM
$ curl --head --silent --location http://www.eonline.com/au/news/386489/2013-grammy-awards-winners-the-complete-list
HTTP/1.1 405 Method Not Allowed
Server: Apache/2.2.3 (CentOS) mod_jk/1.2.32 PHP/5.1.6
Allow: GET
Content-Language: en
Content-Length: 1090
Content-Type: text/html;charset=ISO-8859-1
Cache-Control: max-age=60
Expires: Mon, 18 Apr 2016 17:49:41 GMT
Date: Mon, 18 Apr 2016 17:48:41 GMT
Connection: keep-alive
Set-Cookie: adEdition=us; expires=Tue, 19-Apr-2016 17:48:41 GMT; path=/; domain=.eonline.com
Set-Cookie: geoEdition=us; expires=Tue, 19-Apr-2016 17:48:41 GMT; path=/; domain=.eonline.com
Vary: User-Agent

Dropping the --head option gives a 200 OK response. This indicates that the server at www.eonline.com doesn't support the HEAD verb. I think that this is actually not going to be uncommon in general. The use of HEAD requests by the bot is a good idea generally, but it should probably treat any 4xx other than 404 and any 5xx response code to HEAD requests as a signal to try a normal GET request.

The GET should only be done if HEAD proves to be fruitless. If a server doesn't support head, it would return a 405, so why consider a 5xx or 4xx codes that aren't 404s or 405s, to be a bad HEAD request? I'm a little confused.

bd808 added a comment.Apr 19 2016, 1:04 AM

The GET should only be done if HEAD proves to be fruitless. If a server doesn't support head, it would return a 405, so why consider a 5xx or 4xx codes that aren't 404s or 405s, to be a bad HEAD request? I'm a little confused.

Reliance on the 405 status code is assuming that the backing server actually uses the proper HTTP status code for its response. I'm pretty sure that you are going to run into lots of them that do not. I think it should be safe to assume that a 404 response to a HEAD request means that the requested resource does not exist, but I don't think it is safe to assume that only 405 will be used to signal "OMG I never read the HTTP spec and I have no idea what to do with the HEAD verb".

Niharika claimed this task.Apr 26 2016, 8:23 AM
Niharika moved this task from Ready to In Development on the Community-Tech-Sprint board.

Just one minor suggestion (inline at GitHub).

kaldari closed this task as Resolved.Apr 29 2016, 9:20 PM
kaldari moved this task from Needs Review/Feedback to Done on the Community-Tech-Sprint board.

When I modified Niharika's code, while it's generating the correct output, the tests for some reason fail.

DannyH moved this task from Untriaged to Archive on the Community-Tech board.