Page MenuHomePhabricator

Wayback Machine: Review cyberbot code
Closed, ResolvedPublic5 Estimated Story Points


Review the cyberbot code.

Event Timeline

DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH moved this task to Ready on the Community-Tech-Sprint board.
DannyH subscribed.

Here are some suggestions for improving the Cyberbot II code:

Execution code should be separated from business logic code so that the business logic is unit-testable (among other reasons). (Right now loading deadlink.php also executes the bot.) Probably the easiest way to do this is to create a separate file for the bot execution code (deadlinkbot.php?) that includes the file for the logic code. It would be good to put the logic code in a class at the same time.

There’s a lot of redundant curl code. It would be better to have a separate class for handling the HTTP connections and instantiate that class as needed.

The config values at the beginning ($LINK_SCAN, $DEAD_ONLY, etc.) are sometimes strings (“1”) and sometimes integers. Type juggling should be avoided if possible as it often causes bugs.

There are a few places in the code where it does a foreach loop on $pages without checking that $pages is an array (it can be false).

It shouldn't use PHP closing tags ("?>") unless actually switching to HTML output (in order to avoid premature header bugs).

Some variable names are vague like $returnArray, $return, and $t. Variable names should describe their contents specifically and verbosely if possible to improve code readability. Also, there is at least one case where $returnArray is not actually the array returned by the function (getAllArticles()), which is confusing.

It’s helpful (especially in open source projects) for all functions to have documentation unless they are completely obvious (such as a getter or setter). PHPDoc or Doxygen documentation styles are good standards to follow. See for some examples.

It would also be nice if the dead links code were moved to a separate repo so that it was more modular and reusable.

Additional task:
4 to 6 times a year a script shout test the original URL for http status code. If there is a new redirect location, the original URL should archive once again.

@Boshomi: Not sure I follow your comment. Are you saying that for a citation that already has an archiveurl value, if the original URL changes http status code, the URL should once again be archived? And if, so should the archiveurl value be updated as well? That seems a bit risky to me (if I understand correctly), as dead URLs are often reborn as redirects to generic information pages or link farms.

@kaldari It is generally better to replace a dead link with an alternative URL from the same domain. To replace dead URLs with webarchiv-URLs is too easy, because in real world the webadmins don't like this, and the reacts by adjusting rotbots.txt. The consequences are the lost of millions of useful archived websites, because Internet Archive will delete this sites if there is a robots.txt rule.

a possible workflow:

  1. User ad $URL1 to a article.
  2. a Bot save the content by sending the $URL1 to$URL1
  3. the Bot save the Metadata of $URL1

The webadmin makes a redirect from $URL1 to $URL1old; $URL1 get 301 status code

  1. Little time after that a Bot checks the status code of $URL1 and get 301 with locator $URL1old
  2. At this time we should Upload$URL1 once again, and Internet Archive will save the location $URL1old
  3. if $URL1old is still alive, a user can replace $URL1 with $URL1old in the article

Does this still need to be open. The code is under constant review.