Page MenuHomePhabricator

Convert weblinkchecker to requests
Closed, ResolvedPublic

Description

weblinkchecker should be converted to requests.

The class LinkChecker should be deprecated using @deprecated, and otherwise not modified. The class LinkChecker is too intrinsically tied to httplib to be enhanced without causing backwards compatibility bugs.

The function check may also be deprecated using @deprecated, or rewritten using requests.

LinkCheckThread needs to be revised, rewriting the first half of method run using similar functionality using requests.

It only needs to perform a http fetch, which can be done using pywikibot.comms.http.fetch or requests directly, check the http status code against HTTPignore, and then call self.history methods as it already does.

It should skip any URL which matches the pattern in ignorelist.

To test, use https://en.wikipedia.org/wiki/User:John_Vandenberg/test_T113596

$ rm -rf deadlinks/
$ python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T113596 -day:0 -talk
....

$ python -c "import pickle, pprint; print(pprint.pformat(pickle.loads(open('deadlinks/deadlinks-wikipedia-en.dat').read())))"
{u'http://httpbin.org/basic-auth/user/passwd': [(u'User:John Vandenberg/test T113596',
                                                 1453205828.731368,
                                                 u'401 UNAUTHORIZED')],
 u'http://httpbin.org/status/404': [(u'User:John Vandenberg/test T113596',
                                     1453205828.694601,
                                     u'404 NOT FOUND')],
 u'http://httpbin.org/status/410': [(u'User:John Vandenberg/test T113596',
                                     1453205828.74228,
                                     u'410 GONE')],
 u'http://www.admi.net/jo/20010615%C2%A6/ECOC0100037D.html': [(u'User:John Vandenberg/test T113596',
                                                               1453205828.913141,
                                                               u'404 Not Found')],
 u'https://self-signed.pythontest.net/': [(u'User:John Vandenberg/test T113596',
                                           1453205828.191287,
                                           u"Socket Error: u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)'")]}

$ python pwb.py weblinkchecker -family:wikipedia -lang:en -page:User:John_Vandenberg/test_T113596 -day:0 -talk
....

This should also post a notice on https://en.wikipedia.org/wiki/User_talk:John_Vandenberg/test_T113596 (blank that page after each test run)

To test deprecation, do

$ python pwb.py shell -log -debug -verbose
...
>>> from scripts import weblinkchecker
>>> weblinkchecker.check('http://google.com')

It should produce a useful message.

Event Timeline

jayvdb claimed this task.
jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added subscribers: gerritbot, jayvdb, Ricordisamoa and 2 others.
jayvdb triaged this task as Medium priority.Jan 19 2016, 12:36 PM
jayvdb updated the task description. (Show Details)

@jayvdb,
I'm pretty lost. Not sure what needs to be rewritten using requests. Are you saying just not use LinkChecker at all? I guess I just don't really understand what the function I'm changing is doing. Please guide me a little.
Thanks,
MtDu

@MtDu, the function is checking that each link is alive. It is very similar to reflinks.py. Do a fetch, check the response, and for non-OK status codes capture the error message / exception message.

Change 265398 had a related patch set uploaded (by MtDu):
[WIP] Convert weblinkchecker to requests

https://gerrit.wikimedia.org/r/265398

Change 265398 merged by jenkins-bot:
Convert weblinkchecker to requests

https://gerrit.wikimedia.org/r/265398

jayvdb reassigned this task from jayvdb to MtDu.
jayvdb removed a project: Patch-For-Review.