Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Cyberpower678 | T120433 Migrate dead external links to archives | |||
Resolved | Cyberpower678 | T132606 Advanced deadlink detection for Cyberbot (tracking) | |||
Resolved | MusikAnimal | T136728 Investigate causes for deadlinks found after trial |
Event Timeline
Alright, here's my assessment of the identified false positives, skipping over the paywall sites.
False false positives:
- The Times of India links (example) – appear to redirect to the home page.
- http://dc.the-netherlands.org/ – redirects to home page
Can't repro the dead link behaviour:
- http://121.52.157.166/office/research/Journal/B_J_2006/1_Biodiversity%20of%20saurischian%20dinosaurs%20from%20the%20latest%20cretaceous%20park%20of%20pakistan.pdf
- http://avibase.bsc-eoc.org/checklist.jsp?lang=EN®ion=ve&list=clements
- http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=urb_lpop1&lang=en
- http://batterys.over-blog.com/article-wal-mart-stores-inc-85031463.html
- http://beltmag.com/secret-history-chief-wahoo/
- http://bioguide.congress.gov/scripts/biodisplay.pl?index=C001030
- http://dallas.bizjournals.com/
- http://blogs.chron.com/sciguy/archives/2008/09/post_39.html
- http://dallas.bizjournals.com/dallas/stories/2003/09/15/daily27.html
- http://dx.doi.org/10.1021/ja036138+
- http://ec.europa.eu/energy/en/topics/renewable-energy/renewable-energy-directive/cooperation-mechanisms
- http://economictimes.indiatimes.com/news/defence/here-is-why-apache-and-chinook-helicopters-are-game-changers-for-india/articleshow/49067786.cms
- http://encyclopedia.densho.org/Civilian%20exclusion%20orders/
- http://english.peopledaily.com.cn/200309/21/eng20030921_124638.shtml
- http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31996R1107:EN:HTML
- http://frontiersman.com/articles/2008/09/02/opinion/columnists/doc48bcf03a0e877323781165.txt
- http://gc.kls2.com/airport/KRHP
- http://geonames.usgs.gov/pls/gazpublic/getgooglemap?p_lat=42.6178099&p_longi=-83.4135489&fid=626335
- http://green.autoblog.com/2009/09/04/toyota-tops-2-million-hybrid-sales-worldwide/
- http://gulfnews.com/news/gulf/uae/environment/cloud-seeding-experiment-has-thundering-success-1.104086
In my testing, the bot code is reporting these links as being alive, which is correct. I might be doing this wrong. Here's how I tested it:
$cid = new checkIfDead(); $u = "https://website.com"; var_dump( $cid->checkDeadlinks( array( $u ) ) ); // also tried: var_dump( $cid->checkDeadlinks( array( $u ), true ) );
For these I got:
array(2) { ["results"]=> array(1) { [0]=> bool(false) } ["errors"]=> array(1) { [0]=> string(0) "" } }
which suggests the link is alive. That being said, it shouldn't have been reported at Cyberbot II 5a/DB Results.
Questionable behaviour:
- http://content.onlinejacc.org/cgi/content/full/41/9/1633 – the code immediately reports the link is alive (which is correct), but the site actually redirects at least 3 times. I don't think code is following all of the redirects
Legitimate false positives
Returning a 500:
- http://archive.metropolis.co.jp/tokyo/625/music_beat.asp
- http://archives.lse.ac.uk/TreeBrowse.aspx?src=CalmView.Catalog&field=RefNo&key=PHILLIPS
Returning a 404
- http://bigstory.ap.org/article/2nd-chinese-university-starts-rhodes-style-program (and other bigstory.ap.org links)
Returning a 301 (fixed by pull #18):
Returning a 200 (all fixed by pull #18)
- http://data.worldjusticeproject.org/#/index/VEN
- http://flysunairexpress.com/#about
- http://guster.com/#news (this one is probably desirable as being dead, since the News page will change)
@Cyberpower678 Regarding the "Can't repro" links, are we sure we're using the latest version of checkIfDead?
My next step is to further analyze the "legitimate false positives".
The first link in that list of yours was last checked on 2016-04-23 03:49:09. It currently has a live state of 1, which is still considered alive by cyberbot
They were all checked on the same day, which goes back to the first trial. Cyberbot never got around to rechecking them again, since the second trial only used 2 workers, instead of the current number, being 28.
This PR fixes four of the aforementioned "legitimate false positives": https://github.com/cyberpower678/Cyberbot_II/pull/18/files
Some more false positives from the new list (I've only gone through the first 100):
- http://www.gutenberg.org/files/39000/39000-h/39000-h.htm
- http://www.halakhah.com/nazir/nazir_23.html
- http://www.halakhah.com/nazir/nazir_23.html#PARTb
- http://www.halakhah.com/niddah/niddah_31.html
- http://www.halakhah.com/sanhedrin/sanhedrin_98.html#PARTb
- http://www.halakhah.com/yebamoth/yebamoth_65.html#PARTb
- ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/
- ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc
@MusikAnimal: It would be good to check the false positives above before they get too stale.
@kaldari From your comment above, are you sure you were looking at the right list? I don't see those URLs at https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5a/DB_Results
I went through not only each domain, but each sub-path of each domain, and here are the false positives:
- http://www.parti-socialiste.fr/articles/second-tour-des-primaires-citoyennes-les-resultats (this one redirects to archives.parti-socialiste.fr, and appears to be the info we want)
- http://www.pdngallery.com/cobrand/nikonnet/masters/sandro/qa.html
- http://www.peeep.us/8a95f527 (not sure)
- http://www.perfessorbill.com/pbmusic_joplin2.shtml
- http://www.perfessorbill.com/pbmusic_prerag.shtml
Out of 1,000 I'd say this is pretty damn good :)
I was actually starting to look through myself, but I guess I don't need to anymore. :p
Feeding the links through the URL tester, http://www.pdngallery.com/cobrand/nikonnet/masters/sandro/qa.html comes back as alive
So do
- http://www.peeep.us/8a95f527
- http://www.perfessorbill.com/pbmusic_joplin2.shtml
- http://www.perfessorbill.com/pbmusic_prerag.shtml
I'll have to investigate why they showed up on the list.
I also reviewed @kaldari's list and everything but the FTPs came back as alive.
The first FTP returned a 530 Access Denied and the second returned, "Server denied you to change the given directory."
To add on, I can't connect to the first mentioned FTP, so that one is dead. As for the second, I'll have to investigate further.
A 0.2% false positive rate seems well within reason. I think we can probably move forward with what we have.
As I mentioned earlier, I might be able to refine the class even more. I will investigate tomorrow.
@MusikAnimal: Yes, my list was from https://en.wikipedia.org/wiki/Wikipedia:Bots/Requests_for_approval/Cyberbot_II_5a/DB_Results. Cyberbot overwrote the list again recently. No idea why.
To add on, I can't connect to the first mentioned FTP, so that one is dead.
@Cyberpower678: It's live for me: ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/
http://www.parti-socialiste.fr/articles/second-tour-des-primaires-citoyennes-les-resultats returns a 404 and then does a redirect, so I would say it's a legit positive dead link rather than a false positive (even though the user still gets to the content). So that takes care of Musikanimal's list.
We still need to figure out the 2 FTP false positives though:
- ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/
- ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc
Do these show up as alive or dead with the latest Cyberbot code? Both of them return live content for me in a browser.
For the first one, I get Access denied: 530 and the class returns true.
For the second one, I get Server denied you to change to the given directory and the class returns true.
After discussion over IRC, it seems like the two possibilities for avoiding similar FTP false positives are:
- Exclude all FTP links from dead-link detection
- Investigate the various FTP options for curl and see if we can improve its detection for cases similar to ftp://ftp.aip.org/epaps/phys_rev_lett/E-PRLTAO-98-047705/ and ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc.
ftp://ftp.rsa.com/pub/pkcs/ascii/layman.asc now marked as alive, after a few modifications of the class.
I committing the changes with some more updates to IABot, that I can hopefully commit by the end of the day today.
I'm going to deploy an update before getting started with the trial. The update will have better integration with other archiving services, with the exception of archive.is so far. I can't seem to find an API there.