Page MenuHomePhabricator

full disk on codesearch8
Closed, ResolvedPublic

Description

The MW core index is down, see https://codesearch-backend.wmcloud.org/_health and https://codesearch.wmcloud.org/core/. https://codesearch-backend.wmcloud.org/core/ gives the following information:

Unable to contact hound. If <https://codesearch.wmcloud.org/_health>
says "starting up", please wait a few minutes for the initial indexing
to complete.

If this error continues, please report it in Phabricator
with the following information:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.7/http/client.py", line 1260, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1306, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1255, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.7/http/client.py", line 1030, in _send_output
    self.send(msg)
  File "/usr/lib/python3.7/http/client.py", line 970, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 181, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f115b31f2e8>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=6084): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f115b31f2e8>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/codesearch/app.py", line 241, in proxy
    params=request.args
  File "/usr/lib/python3/dist-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6084): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f115b31f2e8>: Failed to establish a new connection: [Errno 111] Connection refused'))

I've seen this error for a while now, it doesn't appear to be just a transient thing.

Event Timeline

Legoktm renamed this task from Core index is down due to connection errors to full disk on codesearch8.May 24 2023, 3:09 PM
Legoktm claimed this task.
Legoktm triaged this task as Unbreak Now! priority.
Legoktm subscribed.
legoktm@codesearch8:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/sdb         56G   56G     0 100% /srv

Mentioned in SAL (#wikimedia-cloud) [2023-05-24T15:11:38Z] <legoktm> temporarily taking down to free up disk space (T337263)

hound keeps track of git repositories on disk by using their URL. So if we change the URL...like T336710: Move clients off of gerrit-replica.wikimedia.org back to gerrit.wikimedia.org, / rLCSHa63cb327fa8e: switch to main gerrit server instead of using the replica then it'll try to clone all the repositories fresh, without deleting the old ones. In graph terms:

Screenshot 2023-05-24 at 11-36-36 Cloud VPS project board - WMCS - Cloud VPS projects - Dashboards - Grafana.png (292×603 px, 22 KB)

I've deleted the old /srv/hound and am letting it recreate from scratch, this might take an hour or two.

I've also extended the volume to 80G, which wouldn't have been enough in this case, but gives us more headroom in general.

All the hound backends are up now, puppet is re-enabled, and /srv is only 42% full.