Page MenuHomePhabricator

weblinkchecker should not process URLS in fullurl
Closed, ResolvedPublic

Description

weblinkchecker.py sees {{fullurl:Special:LinkSearch|target=http://*.books.google.com}} as found on https://en.wikipedia.org/wiki/User:Emijrp/External_Links_Ranking and processes http://*.books.google.com

$ python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:Emijrp/External_Links_Ranking
Retrieving 1 pages from wikipedia:en.


>>> User:Emijrp/External Links Ranking <<<
checking http://download.wikimedia.org/enwiki/20140614/enwiki-20140614-externallinks.sql.gz
checking http://dumps.wikimedia.org/enwiki/
checking http://www.google.com
checking http://google.com
checking http://books.google.com
checking http://google.com
checking http://dx.doi.org
checking http://*.dx.doi.org
checking http://books.google.com
*[[User:Emijrp/External Links Ranking]] links to http://*.dx.doi.org - Socket Error: 'No address associated with hostname'.
checking http://*.books.google.com
ignoring http://web.archive.org due to .*[\./@]web\.archive\.org(/.*)?
ignoring http://*.web.archive.org due to .*[\./@]web\.archive\.org(/.*)?
*[[User:Emijrp/External Links Ranking]] links to http://*.books.google.com - Socket Error: 'No address associated with hostname'.
checking http://google.com
http://www.google.com ok
checking http://*.google.com
...

URLs with * are now excluded due to T124142: Do not process URLs containing * in domain name, so here is a simpler test case for this bug:

python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T124141 -day:0 -verbose -log -debug

Also useful is the test command created for T113140: Convert weblinkchecker to requests , which should still work after this is fixed.

Event Timeline

jayvdb raised the priority of this task from to Low.
jayvdb updated the task description. (Show Details)
jayvdb subscribed.
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript
jayvdb set Security to None.
jayvdb renamed this task from Do not process URLs containing * to weblinkchecker should not process URLS in fullurl .Jan 20 2016, 7:29 AM
jayvdb updated the task description. (Show Details)
jayvdb removed a project: Pywikibot-Scripts.

https://en.wikipedia.org/wiki/User:Emijrp/External_Links_Ranking exists in 33 languages of Wikipedia, so it frequently appears in -weblink:<domain> results.

T124142: Do not process URLs containing * in domain name is another approach to prevent weblinkchecker choking on this page.

Change 265651 had a related patch set uploaded (by MtDu):
Ignore URLs in fullurl

https://gerrit.wikimedia.org/r/265651

Regarding @Ricordisamoa 's comment on https://gerrit.wikimedia.org/r/#/c/265651/ , Ricordisamoa can you give valid syntax for a URL inside localurl, canonicalurl, filepath, or any others?

Regarding @Ricordisamoa 's comment on https://gerrit.wikimedia.org/r/#/c/265651/ , Ricordisamoa can you give valid syntax for a URL inside localurl, canonicalurl, filepath, or any others?

My bad, I thought the script was checking URLs containing magic words.
It's not easy to tell whether and how a URL is going to be used in the final wikitext (and thus should be checked), it may get passed to templates etc.

@Ricordisamoa,
So I only need to check fullurl?
Thanks,
MtDu

another test command

python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T124141 -day:0 -verbose -log -debug

Change 265651 merged by jenkins-bot:
Ignore URLs in fullurl template

https://gerrit.wikimedia.org/r/265651

@jayvdb,
Marking this task as resolved.