Improve external link detection regex in Cyberbot II
Closed, ResolvedPublic5 Story Points

Description

The current Cyberbot II code includes a regex for detecting external links. However, this regex is apparently not detecting all external links in articles. We should try to improve the regex so that it is more fool-proof.

The current regex can be seen in the getExternalLinks() function at https://github.com/cyberpower678/Cyberbot_II/blob/master/deadlink.php.

kaldari created this task.Jan 27 2016, 1:44 AM
kaldari added a project: Community-Tech.
kaldari added subscribers: kaldari, Cyberpower678.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 27 2016, 1:44 AM
kaldari edited the task description. (Show Details)Jan 27 2016, 1:45 AM
kaldari set Security to None.
Cyberpower678 claimed this task.

Sad but true.

kaldari edited the task description. (Show Details)Jan 27 2016, 1:48 AM

@Cyberpower678: I removed the part about writing unit tests for the getExternalLinks() function, as currently that function can't be loaded by PHPUnit without actually triggering the bot. Let me know once that function is in a separate file from the bot execution code and I'll create a Phabricator task for making some unit tests as well.

Also, if you find any specific cases where external links aren't being detected, please post them to this task.

Here is the fully expanded regex for reference:

/(<ref([^\/]*?)>(.*?)((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}.*?)?<\/ref>((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\})*|\[{1}((?:https?:)?\/\/.*?)\s?.*?\]{1}.*?((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}\s*?)*?|((\{\{cite web|\{\{cite AV media|\{\{cite AV media notes|\{\{cite book|\{\{cite conference|\{\{cite DVD notes|\{\{cite encyclopedia|\{\{cite episode|\{\{cite interview|\{\{cite journal|\{\{cite mailing list|\{\{cite map|\{\{cite news|\{\{cite newsgroup|\{\{cite podcast|\{\{cite press release|\{\{cite report|\{\{cite serial|\{\{cite sign|\{\{cite speech|\{\{cite techreport|\{\{cite thesis|\{\{cite Hansard|\{\{vcite2 journal\{\{vancite book|\{\{vancite conference|\{\{vancite journal|\{\{vancite news|\{\{vancite web|\{\{vcite book|\{\{vcite conference|\{\{vcite journal|\{\{vcite news|\{\{vcite web|\{\{cite|\{\{citation).*?\}\})\s*?((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}\s*?)*?)/i

kaldari edited a custom field.Jan 29 2016, 6:47 PM
kaldari triaged this task as "Normal" priority.Jan 29 2016, 6:51 PM
kaldari moved this task from Sprint planning/estimation to Backlog on the Community-Tech board.
Cyberpower678 closed this task as "Resolved".Feb 9 2016, 11:40 PM
DannyH moved this task from Backlog to Archive on the Community-Tech board.Feb 17 2016, 1:40 AM