Page MenuHomePhabricator

SystemdUnitFailed - docker-gc.service - gitlab-runner1004
Closed, ResolvedPublic

Description

Common information

  • alertname: SystemdUnitFailed
  • instance: gitlab-runner1004:9100
  • name: docker-gc.service
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: collaboration-services

Firing alerts


Details

TitleReferenceAuthorSource BranchDest Branch
Add read timeout retries to resilient_df()repos/releng/docker-gc!21dancymaster-I3559acc8983fc5c19c57e545cb5e84e554eb056amaster
Customize query in GitLab

Event Timeline

LSobanski renamed this task from SystemdUnitFailed to SystemdUnitFailed - docker-gc.service - gitlab-runner1004.May 10 2024, 9:57 AM
LSobanski added subscribers: dancy, LSobanski.

We had enough of these recently that it warrants a review. CC @dancy for awareness.

May 09 19:19:43 gitlab-runner1004 systemd[1]: Starting Perform a round of docker image/volume garbage collection...
May 09 19:20:31 gitlab-runner1004 docker[2794776]: [2024-05-09 19:20:31,569] Running df()
May 09 19:22:01 gitlab-runner1004 docker[2794776]: [2024-05-09 19:22:01,418] df() returned 161 images, 4 volumes
May 09 19:22:01 gitlab-runner1004 docker[2794776]: [2024-05-09 19:22:01,419] volumes usage is 4.51 MB, below high water mark of 135 GB.
May 09 19:23:31 gitlab-runner1004 docker[2794776]: Traceback (most recent call last):
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 537, in _make_request
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     response = conn.getresponse()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connection.py", line 461, in getresponse
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     httplib_response = super().getresponse()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     response.begin()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/usr/lib/python3.9/http/client.py", line 307, in begin
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     version, status, reason = self._read_status()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/usr/lib/python3.9/socket.py", line 704, in readinto
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return self._sock.recv_into(b)
May 09 19:23:31 gitlab-runner1004 docker[2794776]: socket.timeout: timed out
May 09 19:23:31 gitlab-runner1004 docker[2794776]: The above exception was the direct cause of the following exception:
May 09 19:23:31 gitlab-runner1004 docker[2794776]: Traceback (most recent call last):
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 486, in send
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     resp = conn.urlopen(
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 845, in urlopen
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     retries = retries.increment(
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/util/retry.py", line 470, in increment
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     raise reraise(type(error), error, _stacktrace)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/util/util.py", line 39, in reraise
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     raise value
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 791, in urlopen
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     response = self._make_request(
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 539, in _make_request
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 371, in _raise_timeout
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     raise ReadTimeoutError(
May 09 19:23:31 gitlab-runner1004 docker[2794776]: urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=90)
May 09 19:23:31 gitlab-runner1004 docker[2794776]: During handling of the above exception, another exception occurred:
May 09 19:23:31 gitlab-runner1004 docker[2794776]: Traceback (most recent call last):
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 482, in <module>
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     main()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 470, in main
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     perform_pruning_round()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 460, in perform_pruning_round
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     maybe_prune_resources(df, type)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 432, in maybe_prune_resources
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     current_usage = get_resources_space_used(df, type)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 291, in get_resources_space_used
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return resilient_df()["LayersSize"]
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/srv/app/./docker-gc.py", line 279, in resilient_df
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return get_docker_client().df()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/docker/client.py", line 191, in df
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return self.api.df()
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 34, in wrapper
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return f(self, *args, **kwargs)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/docker/api/daemon.py", line 22, in df
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return self._result(self._get(url), True)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 46, in inner
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return f(self, *args, **kwargs)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/docker/api/client.py", line 237, in _get
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return self.get(url, **self._set_request_timeout(kwargs))
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 602, in get
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     return self.request("GET", url, **kwargs)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 589, in request
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     resp = self.send(prep, **send_kwargs)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 703, in send
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     r = adapter.send(request, **kwargs)
May 09 19:23:31 gitlab-runner1004 docker[2794776]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 532, in send
May 09 19:23:31 gitlab-runner1004 docker[2794776]:     raise ReadTimeout(e, request=request)
May 09 19:23:31 gitlab-runner1004 docker[2794776]: requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=90)
May 09 19:23:32 gitlab-runner1004 systemd[1]: docker-gc.service: Main process exited, code=exited, status=1/FAILURE
May 09 19:23:32 gitlab-runner1004 systemd[1]: docker-gc.service: Failed with result 'exit-code'.
May 09 19:23:32 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.
May 09 19:28:43 gitlab-runner1004 systemd[1]: Starting Perform a round of docker image/volume garbage collection...
May 09 19:28:44 gitlab-runner1004 docker[2797092]: [2024-05-09 19:28:44,289] Running df()
May 09 19:29:04 gitlab-runner1004 docker[2797092]: [2024-05-09 19:29:04,787] df() returned 161 images, 4 volumes
May 09 19:29:04 gitlab-runner1004 docker[2797092]: [2024-05-09 19:29:04,788] volumes usage is 4.51 MB, below high water mark of 135 GB.
May 09 19:29:21 gitlab-runner1004 docker[2797092]: [2024-05-09 19:29:21,443] images usage is 25.88 GB, below high water mark of 135 GB.
May 09 19:29:21 gitlab-runner1004 systemd[1]: docker-gc.service: Succeeded.
May 09 19:29:21 gitlab-runner1004 systemd[1]: Finished Perform a round of docker image/volume garbage collection.
May 09 19:34:21 gitlab-runner1004 systemd[1]: Starting Perform a round of docker image/volume garbage collection...
May 09 19:34:22 gitlab-runner1004 docker[2803109]: [2024-05-09 19:34:22,332] Running df()
May 09 19:34:39 gitlab-runner1004 docker[2803109]: [2024-05-09 19:34:39,253] df() returned 161 images, 6 volumes
May 09 19:34:39 gitlab-runner1004 docker[2803109]: [2024-05-09 19:34:39,254] volumes usage is 6.08 MB, below high water mark of 135 GB.
May 09 19:34:56 gitlab-runner1004 docker[2803109]: [2024-05-09 19:34:56,001] images usage is 25.88 GB, below high water mark of 135 GB.
May 09 19:34:56 gitlab-runner1004 systemd[1]: docker-gc.service: Succeeded.
May 09 19:34:56 gitlab-runner1004 systemd[1]: Finished Perform a round of docker image/volume garbage collection.
May 09 19:40:05 gitlab-runner1004 systemd[1]: Starting Perform a round of docker image/volume garbage collection...
May 09 19:40:56 gitlab-runner1004 docker[2812070]: [2024-05-09 19:40:56,199] Running df()
May 09 19:41:25 gitlab-runner1004 docker[2812070]: [2024-05-09 19:41:25,729] df() returned 161 images, 6 volumes
May 09 19:41:25 gitlab-runner1004 docker[2812070]: [2024-05-09 19:41:25,730] volumes usage is 6.17 MB, below high water mark of 135 GB.
May 09 19:41:57 gitlab-runner1004 docker[2812070]: [2024-05-09 19:41:57,420] images usage is 25.88 GB, below high water mark of 135 GB.
May 09 19:42:28 gitlab-runner1004 systemd[1]: docker-gc.service: Succeeded.
May 09 19:42:28 gitlab-runner1004 systemd[1]: Finished Perform a round of docker image/volume garbage collection.
May 09 19:47:30 gitlab-runner1004 systemd[1]: Starting Perform a round of docker image/volume garbage collection...
May 09 19:47:31 gitlab-runner1004 docker[2814297]: [2024-05-09 19:47:31,996] Running df()
May 09 19:49:02 gitlab-runner1004 docker[2814297]: Traceback (most recent call last):
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 537, in _make_request
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     response = conn.getresponse()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connection.py", line 461, in getresponse
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     httplib_response = super().getresponse()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     response.begin()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/usr/lib/python3.9/http/client.py", line 307, in begin
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     version, status, reason = self._read_status()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/usr/lib/python3.9/socket.py", line 704, in readinto
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return self._sock.recv_into(b)
May 09 19:49:02 gitlab-runner1004 docker[2814297]: socket.timeout: timed out
May 09 19:49:02 gitlab-runner1004 docker[2814297]: The above exception was the direct cause of the following exception:
May 09 19:49:02 gitlab-runner1004 docker[2814297]: Traceback (most recent call last):
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 486, in send
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     resp = conn.urlopen(
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 845, in urlopen
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     retries = retries.increment(
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/util/retry.py", line 470, in increment
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     raise reraise(type(error), error, _stacktrace)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/util/util.py", line 39, in reraise
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     raise value
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 791, in urlopen
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     response = self._make_request(
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 539, in _make_request
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/urllib3/connectionpool.py", line 371, in _raise_timeout
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     raise ReadTimeoutError(
May 09 19:49:02 gitlab-runner1004 docker[2814297]: urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=90)
May 09 19:49:02 gitlab-runner1004 docker[2814297]: During handling of the above exception, another exception occurred:
May 09 19:49:02 gitlab-runner1004 docker[2814297]: Traceback (most recent call last):
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/srv/app/./docker-gc.py", line 482, in <module>
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     main()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/srv/app/./docker-gc.py", line 470, in main
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     perform_pruning_round()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/srv/app/./docker-gc.py", line 455, in perform_pruning_round
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     df = resilient_df()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/srv/app/./docker-gc.py", line 279, in resilient_df
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return get_docker_client().df()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/docker/client.py", line 191, in df
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return self.api.df()
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 34, in wrapper
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return f(self, *args, **kwargs)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/docker/api/daemon.py", line 22, in df
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return self._result(self._get(url), True)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/docker/utils/decorators.py", line 46, in inner
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return f(self, *args, **kwargs)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/docker/api/client.py", line 237, in _get
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return self.get(url, **self._set_request_timeout(kwargs))
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 602, in get
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     return self.request("GET", url, **kwargs)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 589, in request
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     resp = self.send(prep, **send_kwargs)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/requests/sessions.py", line 703, in send
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     r = adapter.send(request, **kwargs)
May 09 19:49:02 gitlab-runner1004 docker[2814297]:   File "/opt/lib/python/site-packages/requests/adapters.py", line 532, in send
May 09 19:49:02 gitlab-runner1004 docker[2814297]:     raise ReadTimeout(e, request=request)
May 09 19:49:02 gitlab-runner1004 docker[2814297]: requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=90)
May 09 19:49:02 gitlab-runner1004 systemd[1]: docker-gc.service: Main process exited, code=exited, status=1/FAILURE
May 09 19:49:02 gitlab-runner1004 systemd[1]: docker-gc.service: Failed with result 'exit-code'.
May 09 19:49:02 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.

The errors are read timeouts while waiting for dockerd to respond to the df request. Looking at prior successful log entries, it looks like this operation normally takes about 20 seconds.

May 09 19:13:44 gitlab-runner1004 docker[2792962]: [2024-05-09 19:13:44,412] Running df()
May 09 19:14:04 gitlab-runner1004 docker[2792962]: [2024-05-09 19:14:04,031] df() returned 161 images, 4 volumes

It looks like there were only two failures. @LSobanski Is there a setting that can be changed to allow more docker-gc.service failures within a particular window before alerting? Here's the recent history of failures on that host:

Apr 17 03:59:54 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.
Apr 25 09:13:58 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.
Apr 30 15:50:55 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.
May 09 19:23:32 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.
May 09 19:49:02 gitlab-runner1004 systemd[1]: Failed to start Perform a round of docker image/volume garbage collection.

I would have liked to seen what was going on with the host at the time of the failures.

Is there a setting that can be changed to allow more docker-gc.service failures within a particular window before alerting?

So far it's all generic "state of systemd" monitoring that creates a ticket for any service in state "failed".

Basically if the output of sudo systemctl list-units --state=failed contains something it will turn into an alert.

So we will have to look at the systemd unit file level if something can be done there to delay the failed state.

Or we would need more custom monitoring just for docker.gc first.

Current unit file of docker.gc service:

[Unit]
Description=Perform a round of docker image/volume garbage collection
Documentation=https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

[Service]
Type=oneshot
User=root
ExecStart=/usr/bin/docker run --rm         --user root         -v /var/run/docker.sock:/var/run/docker.sock         -v docker-resource-monitor:/state         docker-registry.wikimedia.org/repos/releng/docker-gc/docker-gc:1.2.0         --timeout 90         --image-filter 'id=~.*'         --volume-filter 'label:com.gitlab.gitlab-runner.type==cache'         --images 135gb:130gb         --volumes 135gb:130gb         --state-file /state/state.json

The errors are read timeouts

^ maybe the --timeout 90 is something to tune then

Looks like the timeout is already fully puppetized and in Hiera:

hieradata/role/common/gitlab_runner.yaml:profile::gitlab::runner::docker_gc_timeout: 90
hieradata/cloud/eqiad1/devtools/hosts/gitlab-runner-1003.yaml:profile::gitlab::runner::docker_gc_timeout: 90
hieradata/cloud.yaml:profile::gitlab::runner::docker_gc_timeout: 90
modules/profile/manifests/gitlab/runner.pp:    Integer                                     $docker_gc_timeout  = lookup('profile::gitlab::runner::docker_gc_timeout'),

I think I'll modify docker-gc to catch this particular problem and retry a few times before failing.

LSobanski triaged this task as Medium priority.May 13 2024, 3:35 PM
LSobanski moved this task from Incoming to Consultation on the collaboration-services board.

Change #1032507 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] docker-gc: Use 1.3.0 image

https://gerrit.wikimedia.org/r/1032507

Change #1032507 merged by Dzahn:

[operations/puppet@production] docker-gc: Use 1.3.0 image

https://gerrit.wikimedia.org/r/1032507