Page MenuHomePhabricator

Browser tests running against beta all failing because of mw-api-siteinfo.py
Closed, ResolvedPublic

Description

Started happening consistently on the 4th. Examples:

https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/557/console
https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/558/console
https://integration.wikimedia.org/ci/job/browsertests-MultimediaViewer-en.wikipedia.beta.wmflabs.org-os_x_10.9-safari-sauce/559/console

08:12:09 Traceback (most recent call last):
08:12:09 File "/srv/deployment/integration/slave-scripts/bin/mw-api-siteinfo.py", line 92, in <module>
08:12:09 main()
08:12:09 File "/srv/deployment/integration/slave-scripts/bin/mw-api-siteinfo.py", line 77, in main
08:12:09 response = requests.get(mw_api_url, params=API_QUERY)
08:12:09 File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in get
08:12:09 return request('get', url, kwargs)
08:12:09 File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
08:12:09 return session.request(method=method, url=url,
kwargs)
08:12:09 File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 455, in request
08:12:09 resp = self.send(prep, send_kwargs)
08:12:09 File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 558, in send
08:12:09 r = adapter.send(request,
kwargs)
08:12:09 File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
08:12:09 raise ConnectionError(e)
08:12:09 requests.exceptions.ConnectionError: HTTPConnectionPool(host='en.wikipedia.beta.wmflabs.org', port=80): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&siprop=general&format=json (Caused by <class 'socket.error'>: [Errno 110] Connection timed out)

The URL being requested works fine right now: http://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&meta=siteinfo&siprop=general&format=json

Maybe the format changed and the python script keeps retrying and not finding what it needs? (I haven't looked at the source yet)

Event Timeline

Gilles raised the priority of this task from to High.
Gilles updated the task description. (Show Details)
Gilles subscribed.

Browser tests are definitely in bad shape across the board but beta is probably to blame. The only ones that stayed green are running against non-beta sites.

Gilles renamed this task from Browser test failing because of mw-api-siteinfo.py to Browser tests failing because of mw-api-siteinfo.py.Apr 6 2015, 11:15 AM
Gilles renamed this task from Browser tests failing because of mw-api-siteinfo.py to Browser tests running against beta all failing because of mw-api-siteinfo.py.
Gilles raised the priority of this task from High to Unbreak Now!.
Gilles set Security to None.

Tried to run a test manually and the issue is still happening.

Tried hitting the API url in a SauceLabs interactive session and it works fine:
https://saucelabs.com/tests/566c6261360e4c9c8f74f8cf09745f2c

Although, if I follow the script correctly, the http request probably runs on the integration-slave* machines and not on saucelabs, which comes into play later.

If someone who's an admin on labs for the "Integration" project could add me to it, that'd be great. Right now I can't SSH into the integrattion-slave* machines to troubleshoot this further.

Requesting the beta API url from an unrelated labs instance works fine.

The issue, which seems to be a network/connectivity problem, seems to have actually started on April 3rd, and affects other urls as well: https://integration.wikimedia.org/ci/job/UploadWizard-api-commons.wikimedia.beta.wmflabs.org/1729/console

Sorry about the radio silence here.

If someone who's an admin on labs for the "Integration" project could add me to it, that'd be great. Right now I can't SSH into the integrattion-slave* machines to troubleshoot this further.

Should now be done. You've even now got sudo :)

The root cause is that the CI instances have been migrated to a new DNS resolver which for *.beta.wmflabs.org replied with the public IP address instead of the private IP of the instance. Since labs public IP are not reacheable from labs due to NAT, the beta cluster was no more reacheable.

References:

It is all good now :)