Wayback Machine: Review cyberbot code
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	• DannyH
	Dec 22 2015, 6:14 PM

Description

Review the cyberbot code.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Cyberpower678	T120433 Migrate dead external links to archives
		Resolved		kaldari	T122227 Wayback Machine: Review cyberbot code

Event Timeline

• DannyH created this task.Dec 22 2015, 6:14 PM

• DannyH raised the priority of this task from to Needs Triage.

• DannyH updated the task description. (Show Details)

• DannyH added projects: Community-Tech-Sprint, Community-Wishlist-Survey-2015, Community-Tech-fixes.

• DannyH moved this task to Ready on the Community-Tech-Sprint board.

• DannyH subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 22 2015, 6:14 PM

• DannyH triaged this task as Medium priority.Dec 22 2015, 6:14 PM

• DannyH moved this task from Wishlist 51-on to In Analysis on the Community-Wishlist-Survey-2015 board.

• DannyH added a parent task: T120850: Investigation: Migrate dead external links to archives.Dec 22 2015, 6:32 PM

• DannyH added a parent task: T120433: Migrate dead external links to archives.

• DannyH removed a project: Community-Tech-fixes.Dec 22 2015, 8:05 PM

• DannyH set Security to None.

kaldari claimed this task.Dec 31 2015, 2:00 AM

kaldari moved this task from Ready to In Development on the Community-Tech-Sprint board.

• DannyH edited a custom field.Jan 11 2016, 9:33 PM

kaldari added a subscriber: Cyberpower678.Jan 19 2016, 10:32 PM

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 19 2016, 10:32 PM

Here are some suggestions for improving the Cyberbot II code:

Execution code should be separated from business logic code so that the business logic is unit-testable (among other reasons). (Right now loading deadlink.php also executes the bot.) Probably the easiest way to do this is to create a separate file for the bot execution code (deadlinkbot.php?) that includes the file for the logic code. It would be good to put the logic code in a class at the same time.

There’s a lot of redundant curl code. It would be better to have a separate class for handling the HTTP connections and instantiate that class as needed.

The config values at the beginning ($LINK_SCAN, $DEAD_ONLY, etc.) are sometimes strings (“1”) and sometimes integers. Type juggling should be avoided if possible as it often causes bugs.

There are a few places in the code where it does a foreach loop on $pages without checking that $pages is an array (it can be false).

It shouldn't use PHP closing tags ("?>") unless actually switching to HTML output (in order to avoid premature header bugs).

Some variable names are vague like $returnArray, $return, and $t. Variable names should describe their contents specifically and verbosely if possible to improve code readability. Also, there is at least one case where $returnArray is not actually the array returned by the function (getAllArticles()), which is confusing.

It’s helpful (especially in open source projects) for all functions to have documentation unless they are completely obvious (such as a getter or setter). PHPDoc or Doxygen documentation styles are good standards to follow. See https://www.mediawiki.org/wiki/Manual:Coding_conventions/PHP#Comments_and_documentation for some examples.

It would also be nice if the dead links code were moved to a separate repo so that it was more modular and reusable.

kaldari closed this task as Resolved.Jan 19 2016, 10:42 PM

kaldari moved this task from In Development to Q3 2018-19 on the Community-Tech-Sprint board.

kaldari mentioned this in T123385: Do some clean-up on Cyberbot II's code.Jan 19 2016, 10:49 PM

Boshomi subscribed.Jan 19 2016, 11:39 PM

• DannyH moved this task from Q3 2018-19 to Q1 2018-19 on the Community-Tech-Sprint board.Jan 22 2016, 8:44 PM

Additional task:
4 to 6 times a year a script shout test the original URL for http status code. If there is a new redirect location, the original URL should archive once again.

@Boshomi: Not sure I follow your comment. Are you saying that for a citation that already has an archiveurl value, if the original URL changes http status code, the URL should once again be archived? And if, so should the archiveurl value be updated as well? That seems a bit risky to me (if I understand correctly), as dead URLs are often reborn as redirects to generic information pages or link farms.

@kaldari It is generally better to replace a dead link with an alternative URL from the same domain. To replace dead URLs with webarchiv-URLs is too easy, because in real world the webadmins don't like this, and the reacts by adjusting rotbots.txt. The consequences are the lost of millions of useful archived websites, because Internet Archive will delete this sites if there is a robots.txt rule.

a possible workflow:

User ad $URL1 to a article.
a Bot save the content by sending the $URL1 to web.archive.org/save/$URL1
the Bot save the Metadata of $URL1

~~
The webadmin makes a redirect from $URL1 to $URL1old; $URL1 get 301 status code
~~

Little time after that a Bot checks the status code of $URL1 and get 301 with locator $URL1old
At this time we should Upload web.archive.org/save/$URL1 once again, and Internet Archive will save the location $URL1old
if $URL1old is still alive, a user can replace $URL1 with $URL1old in the article

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Feb 2 2016, 7:56 PM

• DannyH moved this task from New & TBD Tickets to Archive on the Community-Tech board.

Reopening to review new code: https://github.com/cyberpower678/Cyberbot_II/commits/test-code.

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.Feb 16 2016, 6:37 PM

kaldari moved this task from Q1 2018-19 to Ready on the Community-Tech-Sprint board.

• DannyH removed a parent task: T120850: Investigation: Migrate dead external links to archives.Feb 23 2016, 10:21 PM

• DannyH added a project: Internet-Archive.Mar 1 2016, 11:18 PM

• DannyH added a subscriber: kaldari.

• DannyH edited projects, added Community-Tech; removed Community-Tech-Sprint.Mar 29 2016, 5:57 PM

• DannyH moved this task from Archive to Up Next (June 3-21) on the Community-Tech board.

Liuxinyu970226 subscribed.May 19 2016, 5:20 AM

Does this still need to be open. The code is under constant review.

Cyberpower678 closed this task as Resolved.May 31 2016, 6:55 PM

kaldari moved this task from Up Next (June 3-21) to Archive on the Community-Tech board.Jun 1 2016, 4:10 PM

Liuxinyu970226 unsubscribed.Jun 3 2016, 9:10 AM

Liuxinyu970226 awarded a token.Jan 13 2017, 1:56 AM

Wayback Machine: Review cyberbot codeClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Wayback Machine: Review cyberbot code
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...