Investigate causes for deadlinks found after trial
Closed, ResolvedPublic5 Story Points

Niharika created this task.Jun 1 2016, 6:35 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 1 2016, 6:35 PM
Cyberpower678 triaged this task as High priority.Jun 1 2016, 7:00 PM
Cyberpower678 assigned this task to Niharika.
Cyberpower678 added a subscriber: MusikAnimal.
kaldari set the point value for this task to 5.Jun 2 2016, 5:11 PM
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
Niharika removed Niharika as the assignee of this task.Jun 6 2016, 1:52 PM

Up for grabs for now.

MusikAnimal added a comment.EditedJun 6 2016, 8:51 PM

Alright, here's my assessment of the identified false positives, skipping over the paywall sites.

False false positives:

Can't repro the dead link behaviour:

In my testing, the bot code is reporting these links as being alive, which is correct. I might be doing this wrong. Here's how I tested it:

$cid = new checkIfDead();
$u = "https://website.com";
var_dump( $cid->checkDeadlinks( array( $u ) ) );
// also tried:
var_dump( $cid->checkDeadlinks( array( $u ), true ) );

For these I got:

array(2) {
  ["results"]=>
  array(1) {
    [0]=>
    bool(false)
  }
  ["errors"]=>
  array(1) {
    [0]=>
    string(0) ""
  }
}

which suggests the link is alive. That being said, it shouldn't have been reported at Cyberbot II 5a/DB Results.

Questionable behaviour:

Legitimate false positives
Returning a 500:

Returning a 404

Returning a 301 (fixed by pull #18):

Returning a 200 (all fixed by pull #18)

@Cyberpower678 Regarding the "Can't repro" links, are we sure we're using the latest version of checkIfDead?

My next step is to further analyze the "legitimate false positives".

@Cyberpower678 Regarding the "Can't repro" links, are we sure we're using the latest version of checkIfDead?

My next step is to further analyze the "legitimate false positives".

It's possible Cyberbot never rechecked them. Lemme search the DB.

The first link in that list of yours was last checked on 2016-04-23 03:49:09. It currently has a live state of 1, which is still considered alive by cyberbot

They were all checked on the same day, which goes back to the first trial. Cyberbot never got around to rechecking them again, since the second trial only used 2 workers, instead of the current number, being 28.

This comment was removed by MusikAnimal.
This comment was removed by Cyberpower678.

This PR fixes four of the aforementioned "legitimate false positives": https://github.com/cyberpower678/Cyberbot_II/pull/18/files

PR merged. Cyberpower is going to generate a new list.

@MusikAnimal: It would be good to check the false positives above before they get too stale.

@kaldari From your comment above, are you sure you were looking at the right list? I don't see those URLs at https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5a/DB_Results

I went through not only each domain, but each sub-path of each domain, and here are the false positives:

Out of 1,000 I'd say this is pretty damn good :)

I was actually starting to look through myself, but I guess I don't need to anymore. :p

Feeding the links through the URL tester, http://www.pdngallery.com/cobrand/nikonnet/masters/sandro/qa.html comes back as alive

I also reviewed @kaldari's list and everything but the FTPs came back as alive.

The first FTP returned a 530 Access Denied and the second returned, "Server denied you to change the given directory."

To add on, I can't connect to the first mentioned FTP, so that one is dead. As for the second, I'll have to investigate further.

I would say the false positive rate is 0.2% at current.

A 0.2% false positive rate seems well within reason. I think we can probably move forward with what we have.

A 0.2% false positive rate seems well within reason. I think we can probably move forward with what we have.

As I mentioned earlier, I might be able to refine the class even more. I will investigate tomorrow.

@MusikAnimal: Yes, my list was from https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5a/DB_Results. Cyberbot overwrote the list again recently. No idea why.

To add on, I can't connect to the first mentioned FTP, so that one is dead.

@Cyberpower678: It's live for me: ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/

http://www.parti-socialiste.fr/articles/second-tour-des-primaires-citoyennes-les-resultats returns a 404 and then does a redirect, so I would say it's a legit positive dead link rather than a false positive (even though the user still gets to the content). So that takes care of Musikanimal's list.

We still need to figure out the 2 FTP false positives though:

  • ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/
  • ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc

Do these show up as alive or dead with the latest Cyberbot code? Both of them return live content for me in a browser.

http://www.parti-socialiste.fr/articles/second-tour-des-primaires-citoyennes-les-resultats returns a 404 and then does a redirect, so I would say it's a legit positive dead link rather than a false positive (even though the user still gets to the content). So that takes care of Musikanimal's list.

We still need to figure out the 2 FTP false positives though:

  • ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/
  • ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc Do these show up as alive or dead with the latest Cyberbot code? Both of them return live content for me in a browser.

For the first one, I get Access denied: 530 and the class returns true.
For the second one, I get Server denied you to change to the given directory and the class returns true.

Curling always returns those errors. I don't think we can do anything.

After discussion over IRC, it seems like the two possibilities for avoiding similar FTP false positives are:

  • Exclude all FTP links from dead-link detection
  • Investigate the various FTP options for curl and see if we can improve its detection for cases similar to ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/ and ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc.

ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc now marked as alive, after a few modifications of the class.

I'd say the class is good to go for final approval.

@Cyberpower678: I don't see the changes to the class, can you link to the commit?

@Cyberpower678: I don't see the changes to the class, can you link to the commit?

I didn't commit them yet.

I committing the changes with some more updates to IABot, that I can hopefully commit by the end of the day today.

@Cyberpower678: Any update?

I'm going to deploy an update before getting started with the trial. The update will have better integration with other archiving services, with the exception of archive.is so far. I can't seem to find an API there.

@Cyberpower678: I meant any update on the commit for fixing the FTP issues?

@Cyberpower678: I meant any update on the commit for fixing the FTP issues?

I committed it last night before going to bed as well as an important update.~~~~

kaldari closed this task as Resolved.Jul 7 2016, 12:27 AM

I think we can call this done now :)