Page MenuHomePhabricator

Toolforge returns HTTP 502 error
Closed, ResolvedPublic

Description

The EditGroups tool currently returns a "502 Bad Gateway" when requesting its main page: https://editgroups.toolforge.org/ (or in fact any other page, as far as I can tell).

In my experience this happens regularly (not sure why). When the error is consistent I generally try to restart the service with webservice --backend kubernetes python3.7 restart.

When running this command, I currently get:

Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 438, in <module>
    if job.get_state() != Backend.STATE_RUNNING:
  File "/usr/lib/python2.7/dist-packages/toolsws/backends/kubernetes.py", line 464, in get_state
    pods = self._find_objs("pods", self.webservice_label_selector)
  File "/usr/lib/python2.7/dist-packages/toolsws/backends/kubernetes.py", line 244, in _find_objs
    objs = self.api.get_objects(kind, selector=selector)
  File "/usr/lib/python2.7/dist-packages/toolsws/backends/kubernetes.py", line 596, in get_objects
    version=K8sClient.VERSIONS[kind],
  File "/usr/lib/python2.7/dist-packages/toolsws/backends/kubernetes.py", line 575, in _get
    r = self.session.get(**self._make_kwargs(url, **kwargs))
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='k8s.tools.eqiad1.wikimedia.cloud', port=6443): Max retries exceeded with url: /api/v1/namespaces/tool-editgroups/pods?labelSelector=tools.wmflabs.org%2Fwebservice-version%3D1%2Ctools.wmflabs.org%2Fwebservice%3Dtrue (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7e5216cad0>: Failed to establish a new connection: [Errno -2] Name or service not known',))

Therefore I am unable to fix this outage.

Event Timeline

Bugreporter renamed this task from EditGroups tool unreachable (HTTP 502 error), unable to restart the service to Toolforge returns HTTP 502 error.Sep 10 2020, 3:25 PM

None of the 3 mentioned here show a 502 for me

It looks like the services can be reached again now, at least for me.

Pintoch lowered the priority of this task from Unbreak Now! to High.Sep 10 2020, 4:23 PM

Lowering the priority since the outage seems to have disappeared. Looking at the #wikimedia-cloud freenode channel this was likely due to a failure to reach the Kubernetes cluster due to a DNS misconfiguration.

This was due to an outage caused by an issue with DNS cleanup and naming of the service.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/626399 should prevent it in the future and just to be sure, we have T262562: [infra] Fix the mis-named k8s service in tools and toolsbeta projects to remove the root cause in time.

Sorry for the problem!

Bstorm claimed this task.
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.