weblinkchecker should not process URLS in fullurl
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jayvdb
	Jan 20 2016, 7:16 AM

Description

weblinkchecker.py sees {{fullurl:Special:LinkSearch|target=http://*.books.google.com}} as found on https://en.wikipedia.org/wiki/User:Emijrp/External_Links_Ranking and processes http://*.books.google.com

$ python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:Emijrp/External_Links_Ranking
Retrieving 1 pages from wikipedia:en.


>>> User:Emijrp/External Links Ranking <<<
checking http://download.wikimedia.org/enwiki/20140614/enwiki-20140614-externallinks.sql.gz
checking http://dumps.wikimedia.org/enwiki/
checking http://www.google.com
checking http://google.com
checking http://books.google.com
checking http://google.com
checking http://dx.doi.org
checking http://*.dx.doi.org
checking http://books.google.com
*[[User:Emijrp/External Links Ranking]] links to http://*.dx.doi.org - Socket Error: 'No address associated with hostname'.
checking http://*.books.google.com
ignoring http://web.archive.org due to .*[\./@]web\.archive\.org(/.*)?
ignoring http://*.web.archive.org due to .*[\./@]web\.archive\.org(/.*)?
*[[User:Emijrp/External Links Ranking]] links to http://*.books.google.com - Socket Error: 'No address associated with hostname'.
checking http://google.com
http://www.google.com ok
checking http://*.google.com
...

URLs with * are now excluded due to T124142: Do not process URLs containing * in domain name, so here is a simpler test case for this bug:

python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T124141 -day:0 -verbose -log -debug

Also useful is the test command created for T113140: Convert weblinkchecker to requests , which should still work after this is fixed.

Details

	Subject	Repo	Branch	Lines +/-
	Ignore URLs in fullurl template	pywikibot/core	master	+3 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined	BUG REPORT	None	T57276 weblinkchecker should ignore URLs inside some tags, part 2
		Resolved		MtDu	T124141 weblinkchecker should not process URLS in fullurl

Event Timeline

jayvdb created this task.Jan 20 2016, 7:16 AM

jayvdb raised the priority of this task from to Low.

jayvdb updated the task description. (Show Details)

jayvdb added projects: Pywikibot-weblinkchecker.py, Pywikibot-Scripts, good first task, Google-Code-In-2015.

jayvdb subscribed.

Restricted Application added a project: Internet-Archive. · View Herald TranscriptJan 20 2016, 7:16 AM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

jayvdb edited projects, added Pywikibot; removed Internet-Archive.Jan 20 2016, 7:17 AM

jayvdb set Security to None.

jayvdb renamed this task from Do not process URLs containing * to weblinkchecker should not process URLS in fullurl .Jan 20 2016, 7:29 AM

jayvdb updated the task description. (Show Details)

jayvdb removed a project: Pywikibot-Scripts.

jayvdb moved this task from Backlog to Ready to go on the Pywikibot-weblinkchecker.py board.Jan 20 2016, 7:56 AM

jayvdb added a parent task: T57276: weblinkchecker should ignore URLs inside some tags, part 2.

GCI task: https://codein.withgoogle.com/dashboard/tasks/5158918728712192/

jayvdb moved this task from Proposed to Imported into GCI site on the Google-Code-In-2015 board.Jan 20 2016, 8:01 AM

https://en.wikipedia.org/wiki/User:Emijrp/External_Links_Ranking exists in 33 languages of Wikipedia, so it frequently appears in -weblink:<domain> results.

T124142: Do not process URLs containing * in domain name is another approach to prevent weblinkchecker choking on this page.

Change 265651 had a related patch set uploaded (by MtDu):
Ignore URLs in fullurl

https://gerrit.wikimedia.org/r/265651

gerritbot added a project: Patch-For-Review.Jan 21 2016, 11:00 PM

Regarding @Ricordisamoa 's comment on https://gerrit.wikimedia.org/r/#/c/265651/ , Ricordisamoa can you give valid syntax for a URL inside localurl, canonicalurl, filepath, or any others?

A list of these magic words is at https://www.mediawiki.org/wiki/Help:Magic_words#URL_data

In T124141#1953926, @jayvdb wrote:

Regarding @Ricordisamoa 's comment on https://gerrit.wikimedia.org/r/#/c/265651/ , Ricordisamoa can you give valid syntax for a URL inside localurl, canonicalurl, filepath, or any others?

My bad, I thought the script was checking URLs containing magic words.
It's not easy to tell whether and how a URL is going to be used in the final wikitext (and thus should be checked), it may get passed to templates etc.

@Ricordisamoa,
So I only need to check fullurl?
Thanks,
MtDu

another test command

python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T124141 -day:0 -verbose -log -debug

jayvdb updated the task description. (Show Details)Jan 22 2016, 8:57 PM

Change 265651 merged by jenkins-bot:
Ignore URLs in fullurl template

https://gerrit.wikimedia.org/r/265651

@jayvdb,
Marking this task as resolved.

MtDu closed this task as Resolved.Jan 24 2016, 1:58 AM

weblinkchecker should not process URLS in fullurl Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

weblinkchecker should not process URLS in fullurl
Closed, ResolvedPublic
Actions

Related Objects
Search...