Page MenuHomePhabricator

archive.org test URL scheme error
Closed, ResolvedPublic

Description

There are regularly errors in the builds due to testInternetArchiveNewest

FAIL: testInternetArchiveNewest (tests.weblib_tests.TestInternetArchive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/wikimedia/pywikibot-core/tests/weblib_tests.py", line 35, in testInternetArchiveNewest
    self.assertIn(parsed.scheme, [u'http', u'https'])
AssertionError: b'' not found in ['http', 'https']

https://travis-ci.org/wikimedia/pywikibot-core/jobs/69519308#L3529
https://travis-ci.org/wikimedia/pywikibot-core/jobs/69539877#L3528

I cant say for certain, but I think they have only started occurring since the switch to requests.

Event Timeline

jayvdb assigned this task to VcamX.
jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added projects: Pywikibot, Pywikibot-OAuth.
jayvdb subscribed.

Another very strange error:

ERROR: testInternetArchiveNewest (tests.weblib_tests.TestInternetArchive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/wikimedia/pywikibot-core/tests/weblib_tests.py", line 34, in testInternetArchiveNewest
    parsed = urlparse(archivedversion)
  File "/opt/python/2.7.9/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/opt/python/2.7.9/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

https://travis-ci.org/wikimedia/pywikibot-core/jobs/69526722#L2916
https://travis-ci.org/wikimedia/pywikibot-core/jobs/69914515#L5730

And another, but this one looks quite different to the other two:

======================================================================
ERROR: testInternetArchiveNewest (tests.weblib_tests.TestInternetArchive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5-dev/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5-dev/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 243, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/python/3.5-dev/lib/python3.5/socket.py", line 571, in readinto
    return self._sock.recv_into(b)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 924, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 786, in read
    return self._sslobj.read(len, buffer)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 570, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

...

https://travis-ci.org/wikimedia/pywikibot-core/jobs/69526733#L2843

Hmm maybe it returns the URL without protocol like //en.wikipedia.org ?

jayvdb set Security to None.

New failures where the URL seems to be None: https://travis-ci.org/jayvdb/pywikibot-core/jobs/70398518#L5298 Also my idea in T104761#1426386 doesn't make sense as the URL itself is None and not just the protocol.

And it's only None when “closest” wasn't found in the JSON text. So maybe we need some way to actually get the text returned.

Another very strange error:

ERROR: testInternetArchiveNewest (tests.weblib_tests.TestInternetArchive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/wikimedia/pywikibot-core/tests/weblib_tests.py", line 34, in testInternetArchiveNewest
    parsed = urlparse(archivedversion)
  File "/opt/python/2.7.9/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/opt/python/2.7.9/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

https://travis-ci.org/wikimedia/pywikibot-core/jobs/69526722#L2916
https://travis-ci.org/wikimedia/pywikibot-core/jobs/69914515#L5730

http://archive.org/help/wayback_api.php says about http://archive.org/wayback/available?url=example.com API (which test suite uses):

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

The source code shows that when there's no closest found, getInternetArchiveURL will return None. Maybe it happens that the newest archive of url for testing (which is google.com) is not accessible in the Wayback Machine due to some kind of network problems.

For other exceptions, it seems to me that it's related to comms.http module. Maybe requests is misused? But I tested them on my laptop and it worked fine. @jayvdb mentioned that it occurs regularly, maybe we could track its frequency and decide what to do next.

I'll look into monkey patching the test so that we can see what we get and if @VcamX analysis applies.

Change 224644 had a related patch set uploaded (by XZise):
[IMPROV] tests: General patcher for http module

https://gerrit.wikimedia.org/r/224644

Change 224644 merged by jenkins-bot:
[IMPROV] tests: General patcher for http module

https://gerrit.wikimedia.org/r/224644

Okay we now had a failure and it looks like @VcamX's analysis is correct: https://travis-ci.org/wikimedia/pywikibot-core/jobs/74033957#L4255

That was in testInternetArchiveNewest, and then https://travis-ci.org/wikimedia/pywikibot-core/jobs/74031910#L4508 occurred in testInternetArchiveOlder.

As we know that this url is archived, it must be a transient error at Archive.org. So I think the solution is change these tests so that they first check if weblib.getInternetArchiveURL returned None. If it returned None, skip the test.

The harder solution is to detect the transient error in archive.org , and raise an exception.
If weblib methods did raise an exception, that would break the weblib API, as currently users expect None for failure.

To workaround that, we could introduce new function names that raise an exception, and the old function names catch the exception and return None.

scripts/weblinkchecker.py should be updated to use the new function names.

However, rather than patching weblib, I think the better solution is to simply push weblib out of pywikibot into a new package with a new design. e.g. T85001

Before 'resolving' this by checking for None, maybe it is worthwhile testing the server status code.
I see on https://github.com/kpurdon/waybacklapse/blob/master/wayback/wayback.py#L41 , if res.status_code == 503: ... raise Exception

And another, but this one looks quite different to the other two:

======================================================================
ERROR: testInternetArchiveNewest (tests.weblib_tests.TestInternetArchive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5-dev/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5-dev/lib/python3.5/site-packages/requests/packages/urllib3/connectionpool.py", line 374, in _make_request
    httplib_response = conn.getresponse()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 1174, in getresponse
    response.begin()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 282, in begin
    version, status, reason = self._read_status()
  File "/opt/python/3.5-dev/lib/python3.5/http/client.py", line 243, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/python/3.5-dev/lib/python3.5/socket.py", line 571, in readinto
    return self._sock.recv_into(b)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 924, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 786, in read
    return self._sslobj.read(len, buffer)
  File "/opt/python/3.5-dev/lib/python3.5/ssl.py", line 570, in read
    v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

...

https://travis-ci.org/wikimedia/pywikibot-core/jobs/69526733#L2843

This one was raised upstream by @XZise, where it was closed as unfixable, which isnt strictly true, but anyway it is quite effectively declined.
https://github.com/shazow/urllib3/issues/682

Before 'resolving' this by checking for None, maybe it is worthwhile testing the server status code.
I see on https://github.com/kpurdon/waybacklapse/blob/master/wayback/wayback.py#L41 , if res.status_code == 503: ... raise Exception

Confirmed, res.status code is 503.

$ wget -S 'https://archive.org/wayback/available?url=https://blah.com'
--2015-08-06 11:34:04--  https://archive.org/wayback/available?url=https://blah.com
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 503 Service Temporarily Unavailable
  Server: nginx/1.4.6 (Ubuntu)
  Date: Thu, 06 Aug 2015 01:34:05 GMT
  Content-Type: application/javascript
  Transfer-Encoding: chunked
  Connection: keep-alive
  X-Powered-By: PHP/5.5.9-1ubuntu4.11
2015-08-06 11:34:05 ERROR 503: Service Temporarily Unavailable.


$ python -c 'import requests; r = requests.get("https://archive.org/wayback/available?url=https://blah.com"); print(r.status_code, r.text);'
(503, u'{"archived_snapshots":{}}')

Change 229620 had a related patch set uploaded (by John Vandenberg):
Check Internet Archive API URL is available

https://gerrit.wikimedia.org/r/229620

Change 230741 had a related patch set uploaded (by XZise):
[FIX] weblib_tests: Expect archive test to fail

https://gerrit.wikimedia.org/r/230741

Change 230741 abandoned by XZise:
[FIX] weblib_tests: Expect archive test to fail

Reason:
Due to 3bf9bfac the tests aren't executed anyway.

https://gerrit.wikimedia.org/r/230741

Effectively resolved by re-implementing the underlying code in T85001.

Change 229620 abandoned by John Vandenberg:
Check Internet Archive API URL is available

Reason:
can revisit if the IA API starts working again.

https://gerrit.wikimedia.org/r/229620

Change 229620 restored by John Vandenberg:
Check Internet Archive API URL is available

https://gerrit.wikimedia.org/r/229620

Change 229620 merged by jenkins-bot:
Check Internet Archive API URL is available

https://gerrit.wikimedia.org/r/229620