Advanced deadlink detection for Cyberbot (tracking)
Closed, ResolvedPublic

Description

Max has compiled a list of some links tagged as dead so far by the bot on https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5a

Several of them are not actually dead.
Several of them are behind paywalls, and hence tagged as dead.

We need to do some digging to find out the reason for the not-dead-but-tagged-as-dead links.

Niharika created this task.Apr 13 2016, 5:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 13 2016, 5:27 PM

@Niharika: Do you have any examples of the pages behind paywalls?

@Niharika: Do you have any examples of the pages behind paywalls?

http://www.britishnewspaperarchive.co.uk/viewer/bl/0000510/19030226/074/0006 (There are several links from this same website, all marked as dead.)

Niharika added a comment.EditedApr 14 2016, 8:22 AM

@kaldari, I saw your comments and I found multiple such URLs which return a 404/405 via curl but a 200 on browser. The reason behind this seems to be a missing User-Agent. The obvious fix is to spoof a user-agent such as

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

However I'm not sure of the ethical (and legal?) implications of this, if any.

You can test this as well -

curl -I -s -L http://www.oscars.org/oscars/ceremonies/1966

returns a 403 Forbidden

while...

curl -I -s -L http://www.oscars.org/oscars/ceremonies/1966 -H'User-Agent:  Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0'

returns a 200 OK

@Niharika: Thats a good idea. Let's set the User Agent string as a const in the checkIfDead class and use that with the curl requests. I don't believe there are any legal or ethical problems with spoofing user agent strings. Internet Explorer has been spoofing Netscape Navigator since 1998 :)

Strangely, this didn't fix the 404 I get for http://gym.longinestiming.com/File/000002030000FFFFFFFFFFFFFFFFFF01, but it does seem to fix others.

DannyH triaged this task as Normal priority.Apr 15 2016, 4:57 PM

@Niharika: Needs 1 additional change. See comment at pull request.

@Niharika: Needs 1 additional change. See comment at pull request.

Sorry, I missed this. https://github.com/cyberpower678/Cyberbot_II/pull/14

DannyH renamed this task from Fine-tune Cyberbot deadlink detection to Fine-tune Cyberbot deadlink detection (tracking).Apr 18 2016, 5:51 PM
DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.
DannyH moved this task from Untriaged to Epic/Tracking on the Community-Tech board.
DannyH renamed this task from Fine-tune Cyberbot deadlink detection (tracking) to Advanced deadlink detection for Cyberbot (tracking).Apr 18 2016, 10:30 PM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 8:30 PM

How dare you guys not subscribe me. :p

We should consider randomizing the UAs. We generate a list of UAs being used by legitimate browsers, and randomly pick on from the list. That should keep Cyberbot functional when scanning pages.

Cyberpower678 closed this task as Resolved.Jul 7 2016, 9:18 PM
Cyberpower678 claimed this task.

With nothing left in the tracking ticket, it's time to close this.

DannyH moved this task from Epic/Tracking to Archive on the Community-Tech board.Jul 11 2016, 10:59 PM