Page MenuHomePhabricator

Enhance checkIfDead class to detect when URLs redirect to the domain root
Closed, ResolvedPublic2 Estimated Story Points

Description

Per T125181, add new code to ...
https://github.com/cyberpower678/Cyberbot_II/blob/master/IABot/checkIfDead.php
... to detect when a URL redirects to the domain root, and if so, consider it dead.

Event Timeline

DannyH triaged this task as Medium priority.Feb 23 2016, 6:16 PM
DannyH set the point value for this task to 2.
DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.

Here are two URLs we should successfully detect as dead:
http://www.copart.co.uk/c2/specialSearch.html?_eventId=getLot&execution=e1s2&lotId=10543580
http://forums.lavag.org/Industrial-EtherNet-EtherNet-IP-t9041.html

Note that the first one actually redirects to the domain without the subdomain.

Here are two URLs we should successfully detect as dead:
http://www.copart.co.uk/c2/specialSearch.html?_eventId=getLot&execution=e1s2&lotId=10543580
http://forums.lavag.org/Industrial-EtherNet-EtherNet-IP-t9041.html

Note that the first one actually redirects to the domain without the subdomain.

It turns out there really is no way to find out whether we are on a subdomain or not, in PHP. Best we can do it explode the host part of the url on ".". To grab the absolute root, we can grab the last two parts of the host, but this doesn't work for cases like www.copart.co.uk where it just gives me "co.uk" which is not the root url. Similarly, for "en.wikipedia.org", root is "wikipedia.org". There is no obvious way to differentiate between the two redirects.
This leads to quite complicated and pretty unreliable code: https://github.com/Niharika29/Cyberbot_II/commit/be20420911c0dc7990725874c0b030dcc2e1a41c
I'm not sure of a better way to detect for redirects like forums.lava.org above. Ideas?

kaldari raised the priority of this task from Medium to High.Mar 9 2016, 5:39 PM

Updated. https://github.com/Niharika29/Cyberbot_II/commit/4676a0ee1a2f8ee0eeb4d2644104d40aabce8a99
This seems pretty reliable. I had to comment out the test for copart.co.uk URL since that website is unreachable since yesterday. We should find another that redirects to root domain.