Page MenuHomePhabricator

Create an algorithm to accurately determine if a webpage is dead
Closed, ResolvedPublic5 Estimated Story Points

Description

One of the big tasks Cyberbot will be doing is determining on it's own when a page is dead. This would be particularly useful for links that aren't tagged, but are dead.

We need a function in php that when given a page, it will return true or false for a given page.

For this initial version, it should only return false if the page returns a 4XX or 5XX HTTP error code or redirects to a URL with such an error code. (Assume that curl is available.) Also write at least 2 unit tests, one for a dead link and one for a live link. (I would suggest https://en.wikipedia.org/nothing and https://en.wikipedia.org/wiki/Main_Page.)

Must be written in PHP so it is compatible with https://github.com/cyberpower678/Cyberbot_II/blob/master/deadlink.php. Once the function is written, submit it as a pull request to the Cyberbot II code in Github.

Event Timeline

Cyberpower678 claimed this task.
Cyberpower678 raised the priority of this task from to High.
Cyberpower678 updated the task description. (Show Details)
kaldari updated the task description. (Show Details)
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari edited a custom field.
kaldari renamed this task from Create an algorithm to accurately determine if a webpage is dead. to Create an algorithm to accurately determine if a webpage is dead.Jan 12 2016, 6:01 PM
kaldari updated the task description. (Show Details)
kaldari edited a custom field.
This comment was removed by kaldari.

Initial code at https://github.com/Niharika29/Cyberbot_II

Once this is complete, it would be good to create a script that does several tests over a week or so and records the results in a table in order to verify that a link is really dead and not just temporarily unavailable.

Fhocutt edited a custom field.

Initial code at https://github.com/Niharika29/Cyberbot_II

Once this is complete, it would be good to create a script that does several tests over a week or so and records the results in a table in order to verify that a link is really dead and not just temporarily unavailable.

When I complete the database integration, it will store a value between 0 and 3 regarding the dead state of the link. 0 is dead. When the function returns false, it will decrease the number by one, and if true, it resets to 3. When the number is 0, it skips the check altogethor.

Though if the script can determine when a page is temporarily down or not would be great and avoid the need to rumble over the link 3 times to confirm it's dead.

Code reviewed inline on GitHub. Works fine locally.

@NiharikaKohli: Moving the files is almost certainly going to cause merge conflicts. Can you refactor this commit to only add the new class and test and not move any of the existing files? Re-arranging the files should be handled in the master repo ideally.

Here you can find some tcl-code from @Giftpflanze used by the dewiki dead link detection projekt: https://tools.wmflabs.org/giftbot

note: Giftpflanze has a lot of experience, it is the second run of the bot after 2012. The work for the recent bot-run started in 2013: https://de.wikipedia.org/wiki/Wikipedia:Lua/Werkstatt/Defekter_Weblink_Bot

never the less there are a lot of false positives: history of the error notice site: https://de.wikipedia.org/w/index.php?title=Wikipedia:Defekte_Weblinks/Bot2015-Problem&action=history

Here you can find some tcl-code from @Giftpflanze used by the dewiki dead link detection projekt: https://tools.wmflabs.org/giftbot

note: Giftpflanze has a lot of experience, it is the second run of the bot after 2012. The work for the recent bot-run started in 2013: https://de.wikipedia.org/wiki/Wikipedia:Lua/Werkstatt/Defekter_Weblink_Bot

never the less there are a lot of false positives: history of the error notice site: https://de.wikipedia.org/w/index.php?title=Wikipedia:Defekte_Weblinks/Bot2015-Problem&action=history

Thank you. We can use all the help we can get to develop this algorithm. After which when Cyberbot is completeted for enwiki, it will be adapted to run on other wikis. I have already structured the code so Cyberbot can easily call the proper wiki specific functions for other wikis.

Especially interesting portions are:
dwllib.tcl, proc check_curl*: curl -gLksm200 -A $user_agent -w %{http_code} -o /dev/null $url
List of good status codes: 200 201 202 206 226 229 301 302 304 401 412 503 507 999 (and the next 4 lines)
The file "exceptions" contains tcl [string match] url patterns mostly of sites that don't want to be tested/crawled (they may have a tool labs block). This means that you cannot determine if they're dead or not. (I test with a throttling of 1 second, there may be other methods that prevent blocking.)

The algorithm in dewiki works like this:
Test 3 times in intervals of 14 days (I increased that to 5 times to be sure).
If a link is dead 3 (5) times, it is considered dead.

A suggestion when checking piped URLs (i.e [URL text] format) and citation templates is to search for the content of "text" and the citation templates in the webpage called by the URL. Working links will usually display some of the material, dead links usually don't.