Page MenuHomePhabricator

CodeSearch Docker containers not starting or failing after start because of iptables network overlay issues
Closed, ResolvedPublic

Description

Visiting https://codesearch.wmflabs.org/search/ or https://codesearch.wmflabs.org/deployed/ or others results in:

Unable to contact hound. If <https://codesearch.wmflabs.org/_health>
says "starting up", please wait a few minutes for the initial indexing
to complete.

If this error continues, please report it in Phabricator
with the following information:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.7/http/client.py", line 1244, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1290, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1239, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/usr/lib/python3.7/http/client.py", line 966, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb8ab575e48>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=6080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb8ab575e48>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/codesearch/app.py", line 228, in proxy
    params=request.args
  File "/usr/lib/python3/dist-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb8ab575e48>: Failed to establish a new connection: [Errno 111] Connection refused'))

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 24 2020, 5:13 PM
Feb 24 11:17:10 codesearch6 docker[9953]: 2020/02/24 11:17:10 vcs pull error (Extension:GoogleNewsSitemap - https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/GoogleNewsSitemap.git): exit status 128
Feb 24 11:17:11 codesearch6 docker[9953]: 2020/02/24 11:17:11 Failed to git fetch /data/data/vcs-4be823b1698f6627bc553fc32c52626c801e2a0d, see output below
Feb 24 11:17:11 codesearch6 docker[9953]: fatal: unable to access 'https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/LogEntry.git/': Could not resolve host: gerrit-replica.wikimedia.org
Feb 24 11:17:11 codesearch6 docker[9953]: Continuing...
Feb 24 11:17:11 codesearch6 docker[9953]: 2020/02/24 11:17:11 vcs pull error (Extension:LogEntry - https://gerrit-replica.wikimedia.org/r/mediawiki/extensions/LogEntry.git): exit status 128
Feb 24 11:17:12 codesearch6 docker[9953]: 2020/02/24 11:17:12 Failed to git fetch /data/data/vcs-49d699ca65448d223d18aac87dc96e4affcc641f, see output below
Feb 24 11:17:12 codesearch6 docker[9953]: fatal: unable to access 'https://github.com/SemanticMediaWiki/SemanticInterlanguageLinks/': Could not resolve host: github.com

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T17:37:32Z] <bd808> systemctl restart hound-search.service on codesearch6 (T246017)

bd808 added a subscriber: bd808.Feb 24 2020, 5:41 PM

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T17:37:32Z] <bd808> systemctl restart hound-search.service on codesearch6 (T246017)

I am now seeing a lot of this in the logs:

Feb 24 17:40:53 codesearch6 systemd[1]: Starting hound-search...
Feb 24 17:40:53 codesearch6 docker[3401]: Error response from daemon: Cannot kill container: hound-search: Container 7d887d0b05a863e838fdf752d4a31c33cc169db8d3548b9e9030030af1ed6c83 is not running
Feb 24 17:40:53 codesearch6 docker[3471]: hound-search
Feb 24 17:40:53 codesearch6 docker[3511]: Using default tag: latest
Feb 24 17:40:54 codesearch6 docker[3511]: latest: Pulling from etsy/hound
Feb 24 17:40:54 codesearch6 docker[3511]: Digest: sha256:154068fb2aa61cce26c7221f7a4295fcca1d0013d5dd38d9624f7eb7875106fa
Feb 24 17:40:54 codesearch6 docker[3511]: Status: Image is up to date for etsy/hound:latest
Feb 24 17:40:54 codesearch6 docker[3511]: docker.io/etsy/hound:latest
Feb 24 17:40:55 codesearch6 systemd[1]: Started hound-search.
Feb 24 17:40:56 codesearch6 docker[4187]: /usr/bin/docker: Error response from daemon: driver failed programming external connectivity on endpoint hound-search (ec4cdd5df9729454c13e26110077b990e56c94a1af6ef9aad578777fcd0c4253):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 6080 -j DNAT --to-destination 172.17.0.3:6080 ! -i docker0: iptables: No chain/target/match by that name.
Feb 24 17:40:56 codesearch6 docker[4187]:  (exit status 1)).
Feb 24 17:40:56 codesearch6 docker[4187]: time="2020-02-24T17:40:56Z" level=error msg="error waiting for container: context canceled"
Feb 24 17:40:56 codesearch6 systemd[1]: hound-search.service: Main process exited, code=exited, status=125/n/a
Feb 24 17:40:56 codesearch6 systemd[1]: hound-search.service: Failed with result 'exit-code'.
Feb 24 17:40:56 codesearch6 systemd[1]: hound-search.service: Service RestartSec=100ms expired, scheduling restart.

Same docker/iptables errors from T245920: Codesearch is down?

bd808 renamed this task from CodeSearch down; "Connection refused" to CodeSearch down; "Connection refused" (Docker containers not starting).Feb 24 2020, 5:42 PM

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T17:51:09Z] <bd808> systemctl restart docker on codesearch6 (T246017)

bd808 added a comment.Feb 24 2020, 6:01 PM

Mentioned in SAL (#wikimedia-cloud) [2020-02-24T17:51:09Z] <bd808> systemctl restart docker on codesearch6 (T246017)

This seems to have fixed things. I tried it after finding that iptables -t filter -L -v -n showed no chains named "DOCKER*". My sense making about what is happening goes something like:

  • System boots up
  • ferm applies iptables rules
  • Docker daemon starts up and applies iptables rules
  • hound-search starts up and creates a bunch of Docker containers
  • things are working fine
  • Puppet notices a change in ferm configuration and applies it
  • ferm wipes out Docker specific iptables config because it does not know that it exists
  • everything starts to break in interesting ways

I need to dig around more to find analysis and what we did to fix, but I remember Toolforge's Kubernetes cluster having similar issues in the past due to Docker and ferm fighting over iptables management.

This gerrit patch for CI by @akosiaris looks to be working on almost the same issue -- https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/386166/

@JHedden shared P10501 as way he worked around this problem when testing ceph on Kubernetes.

bd808 renamed this task from CodeSearch down; "Connection refused" (Docker containers not starting) to CodeSearch Docker containers not starting or failing after start because of iptables network overlay issues.Feb 24 2020, 6:18 PM

Change 574524 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] codesearch: Prevent ferm from deleting Docker iptables rules

https://gerrit.wikimedia.org/r/574524

Change 574524 merged by Jhedden:
[operations/puppet@production] codesearch: Prevent ferm from deleting Docker iptables rules

https://gerrit.wikimedia.org/r/574524

Mentioned in SAL (#wikimedia-cloud) [2020-03-06T18:03:30Z] <bd808> Forced Puppet run on codesearch6 to apply Puppet fix for T246017

bd808 closed this task as Resolved.Mar 6 2020, 6:06 PM
bd808 claimed this task.

I forced a puppet run, saw "Notice: /Stage[main]/Codesearch/Ferm::Conf[docker-preserve]/File[/etc/ferm/conf.d/20_docker-preserve]/ensure: defined content as '{md5}b48a374c61bafe0a3088dacef69e526b'" and a ferm restart, and verified that iptables -t filter -L -v -n still shows the Docker rules. https://codesearch.wmflabs.org/search/?q=Docker is working, so I'm going to hope this is fixed. :)