Page MenuHomePhabricator

Improve external link detection regex in Cyberbot II
Closed, ResolvedPublic5 Estimated Story Points

Description

The current Cyberbot II code includes a regex for detecting external links. However, this regex is apparently not detecting all external links in articles. We should try to improve the regex so that it is more fool-proof.

The current regex can be seen in the getExternalLinks() function at https://github.com/cyberpower678/Cyberbot_II/blob/master/deadlink.php.

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari added subscribers: kaldari, Cyberpower678.
kaldari set Security to None.

@Cyberpower678: I removed the part about writing unit tests for the getExternalLinks() function, as currently that function can't be loaded by PHPUnit without actually triggering the bot. Let me know once that function is in a separate file from the bot execution code and I'll create a Phabricator task for making some unit tests as well.

Also, if you find any specific cases where external links aren't being detected, please post them to this task.

Here is the fully expanded regex for reference:

/(<ref([^\/]*?)>(.*?)((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}.*?)?<\/ref>((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\})*|\[{1}((?:https?:)?\/\/.*?)\s?.*?\]{1}.*?((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}\s*?)*?|((\{\{cite web|\{\{cite AV media|\{\{cite AV media notes|\{\{cite book|\{\{cite conference|\{\{cite DVD notes|\{\{cite encyclopedia|\{\{cite episode|\{\{cite interview|\{\{cite journal|\{\{cite mailing list|\{\{cite map|\{\{cite news|\{\{cite newsgroup|\{\{cite podcast|\{\{cite press release|\{\{cite report|\{\{cite serial|\{\{cite sign|\{\{cite speech|\{\{cite techreport|\{\{cite thesis|\{\{cite Hansard|\{\{vcite2 journal\{\{vancite book|\{\{vancite conference|\{\{vancite journal|\{\{vancite news|\{\{vancite web|\{\{vcite book|\{\{vcite conference|\{\{vcite journal|\{\{vcite news|\{\{vcite web|\{\{cite|\{\{citation).*?\}\})\s*?((\{\{dead\-link|\{\{dead link|\{\{404|\{\{dl|\{\{dead|\{\{broken link|\{\{deadlink|\{\{brokenlink|\{\{linkbroken|\{\{link broken|\{\{deadlinks|\{\{dead links|\{\{badlink|\{\{dead page|\{\{dL|\{\{dead Link|\{\{dead url|\{\{dead cite|\{\{deadcite|\{\{bad link|\{\{deadurl|\{\{dead\-inline|\{\{ded link|\{\{wayback|\{\{waybackdate|\{\{iAWM|\{\{iawm|\{\{internetarchive|\{\{webarchive|\{\{wayBack|\{\{archive url|\{\{url archive|\{\{web archive|\{\{webarchiv|\{\{archive\.org|\{\{webCite|\{\{webcite|\{\{webcitation|\{\{cbignore).*?\}\}\s*?)*?)/i

kaldari triaged this task as Medium priority.Jan 29 2016, 6:51 PM
kaldari moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.