Page MenuHomePhabricator

Investigate docker-gc.service failures on GitLab runners
Closed, ResolvedPublic

Description

The garbage collection script is intermittently failing with a timeout, generating an alert. The short fix would be handling the timeout and only alerting after it failed multiple times, the long is to investigate why the timeouts happen.

Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: Traceback (most recent call last):
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 449, in _make_request
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     six.raise_from(e, None)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "<string>", line 3, in raise_from
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 444, in _make_request
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     httplib_response = conn.getresponse()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     response.begin()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/usr/lib/python3.9/http/client.py", line 307, in begin
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     version, status, reason = self._read_status()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/usr/lib/python3.9/socket.py", line 704, in readinto
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return self._sock.recv_into(b)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: socket.timeout: timed out
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: During handling of the above exception, another exception occurred:
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: Traceback (most recent call last):
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 489, in send
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     resp = conn.urlopen(
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 787, in urlopen
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     retries = retries.increment(
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/util/retry.py", line 550, in increment
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     raise six.reraise(type(error), error, _stacktrace)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/packages/six.py", line 770, in reraise
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     raise value
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 703, in urlopen
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     httplib_response = self._make_request(
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 451, in _make_request
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 340, in _raise_timeout
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     raise ReadTimeoutError(
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: During handling of the above exception, another exception occurred:
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: Traceback (most recent call last):
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/srv/app/./docker-gc.py", line 474, in <module>
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     main()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/srv/app/./docker-gc.py", line 462, in main
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     perform_pruning_round()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/srv/app/./docker-gc.py", line 447, in perform_pruning_round
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     df = resilient_df()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/srv/app/./docker-gc.py", line 271, in resilient_df
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return get_docker_client().df()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/docker/client.py", line 191, in df
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return self.api.df()
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 34, in wrapper
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return f(self, *args, **kwargs)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/docker/api/daemon.py", line 22, in df
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return self._result(self._get(url), True)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 46, in inner
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return f(self, *args, **kwargs)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/docker/api/client.py", line 237, in _get
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return self.get(url, **self._set_request_timeout(kwargs))
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 600, in get
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     return self.request("GET", url, **kwargs)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 587, in request
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     resp = self.send(prep, **send_kwargs)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 701, in send
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     r = adapter.send(request, **kwargs)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 578, in send
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]:     raise ReadTimeout(e, request=request)
Nov  1 15:50:39 gitlab-runner1004 docker[4003002]: requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

Details

Other Assignee
dancy
TitleReferenceAuthorSource BranchDest Branch
Add read timeout retries to resilient_df()repos/releng/docker-gc!21dancymaster-I3559acc8983fc5c19c57e545cb5e84e554eb056amaster
Add --timeout optionrepos/releng/docker-gc!12dancyreview/dancy/add-timeout-optionmaster
increase timeout and retry delayrepos/releng/docker-gc!11jeltoincrease-retry-delaymaster
Customize query in GitLab

Event Timeline

@LSobanski Is it just on gitlab-runner1004 ?

No, it fails on different runners, I just used 1004 as an example.

Two thoughts about this one:

docker-gc throws the timeout after 60s, so we could try to raise the timeout here: https://gitlab.wikimedia.org/repos/releng/docker-gc/-/blob/master/docker-gc.py?ref_type=heads#L191
However I'm not sure if that really helps, docker should answer earlier than 60s.

Also I'm not sure if a retry of 1s might be to aggressive here: https://gitlab.wikimedia.org/repos/releng/docker-gc/-/blob/master/docker-gc.py?ref_type=heads#L275. Maybe the code opens too many docker clients with such a aggressive retry. We could increase the retry to maybe 10s or 15s and see if that helps.

Two thoughts about this one:

docker-gc throws the timeout after 60s, so we could try to raise the timeout here: https://gitlab.wikimedia.org/repos/releng/docker-gc/-/blob/master/docker-gc.py?ref_type=heads#L191
However I'm not sure if that really helps, docker should answer earlier than 60s.

The df operation does a stat() of every file in every container and volume. If there are a lot of files, this can take a long time. For example, docker system df -v takes 38 seconds to complete on gitlab-runner1004.eqiad.wmnet right now. That's after having run it a few minutes earlier, so most of the relevant filesystem data is cached. Most of the files appear to be from the buildkitd container. I will look into this.

Also I'm not sure if a retry of 1s might be to aggressive here: https://gitlab.wikimedia.org/repos/releng/docker-gc/-/blob/master/docker-gc.py?ref_type=heads#L275.

Maybe the code opens too many docker clients with such a aggressive retry.

The get_docker_client() function never creates more than one docker client object.

We could increase the retry to maybe 10s or 15s and see if that helps.

I don't expect this to make a difference.

Change 971502 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] Halve profile::gitlab::runner::buildkitd_gckeepstorage

https://gerrit.wikimedia.org/r/971502

It seems that it's the calculation of running container filesystem usage that is taking so long, and docker-gc doesn't really need that usage figure since it only cleans up volumes and images.

The docker system df -v command doesn't seem to accept an option to limit the usage calculation to certain objects, but the corresponding Docker Engine API SystemDataUsage endpoint does. When I tested this query parameter locally, it worked.

$ curl -sS --unix-socket ~/.docker/run/docker.sock http://localhost/system/df?type=volume | jq .
{
  "LayersSize": 0,
  "Images": null,
  "Containers": null,
  "Volumes": [
    {
      "CreatedAt": "2023-10-26T22:09:59Z",
      "Driver": "local",
      "Labels": {
        "com.docker.volume.anonymous": ""
      },
      "Mountpoint": "/var/lib/docker/volumes/346c624ecd514d57a75cdfed0473f9cdc29768a8952e5c92fb54ff025430557b/_data",
      "Name": "346c624ecd514d57a75cdfed0473f9cdc29768a8952e5c92fb54ff025430557b",
      "Options": null,
      "Scope": "local",
      "UsageData": {
        "RefCount": 0,
        "Size": 4553
      }
    },
    [...]
  ],
  "BuildCache": null
}

However, doing this same query on a runner still took a while and returned all types of entries. It seems the Docker version on our runners is not new enough to support this query parameter (we're on Docker Engine 1.41 and it was introduced with 1.42).

Still, we could refactor docker-gc to make a direct query to get usage info only for volumes and images, and whenever we upgrade Docker we'll start to benefit from omitting container filesystem usage calculations.

Still, we could refactor docker-gc to make a direct query to get usage info only for volumes and images, and whenever we upgrade Docker we'll start to benefit from omitting container filesystem usage calculations.

FTR, it seems like curl -sS --unix-socket ~/.docker/run/docker.sock 'http://localhost/system/df?type=volume&type=image' is the correct way to get both volume and image usage in a single query.

Change 971514 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] docker::gc: Add timeout parameter

https://gerrit.wikimedia.org/r/971514

Change 971514 merged by Jelto:

[operations/puppet@production] docker::gc: Add timeout parameter

https://gerrit.wikimedia.org/r/971514

Jelto triaged this task as Medium priority.Nov 6 2023, 8:37 AM

Thanks for adding the timeout parameter to the new 1.2.0 image. I deployed the change. I will monitor the docker-gc service for more failures and alerts.

LSobanski updated Other Assignee, added: dancy.

The job failed again on one of the runners even with increased timeouts:

Nov 06 14:00:39 gitlab-runner1002 docker[1699100]: requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=90)
Nov 06 14:01:05 gitlab-runner1002 systemd[1]: docker-gc.service: Main process exited, code=exited, status=1/FAILURE
Nov 06 14:01:05 gitlab-runner1002 systemd[1]: docker-gc.service: Failed with result 'exit-code'.
Nov 06 14:01:05 gitlab-runner1002 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.

docker-gc script failed when accessing get_docker_client().df() from the docker client.

So we probably need our own implementation to get the volume and image statistics?

Change 971502 merged by Jelto:

[operations/puppet@production] Halve profile::gitlab::runner::buildkitd_gckeepstorage

https://gerrit.wikimedia.org/r/971502

Thanks for the hint, I missed this change in my Gerrit dashboard. I merged it and disk usage decreased significantly on the Trusted Runners: https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview?orgId=1&from=now-3h&to=now&viewPanel=10

I'll keep an eye on the next runs and see if the problem is solved.

After merging and applying the lower buildkitd_gckeepstorage limits there were no more failed docker-gc runs. From my perspective this is resolved now. So I'm closing the task optimistically.

Feel free to reopen if you see the need to use the higher buildkitd storage limits and if additional refactoring is needed.

Thanks again for digging and providing the patches!

Reopened after discussion in T364521

Just to be clear, the config in that change only applies to the Jenkins CI nodes. Unrelated to gitlab runners.

Ah, right. Sorry, mixed up 2 different sets of hosts.

Change #1032507 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] docker-gc: Use 1.3.0 image

https://gerrit.wikimedia.org/r/1032507

Change #1032507 merged by Dzahn:

[operations/puppet@production] docker-gc: Use 1.3.0 image

https://gerrit.wikimedia.org/r/1032507

Closing optimistically.

Mentioned in SAL (#wikimedia-operations) [2024-05-21T18:50:00Z] <mutante> gitlab-runners*.wmnet: ran puppet via cumin to deploy update of docker.gc service to use image 1.3.0 (from 1.2.0) - T350478