Page MenuHomePhabricator

Docker is not running on contint2001
Open, MediumPublic

Description

When rebuilding CI images I got from contint2001.wikimedia.org

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Icinga shows CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service since 2022-07-15 04:29:44. That is a daily job.

The service is marked disabled:

contint2001:~$ systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: https://docs.docker.com

Event Timeline

hashar triaged this task as Unbreak Now! priority.Fri, Jul 15, 12:23 PM
hashar updated the task description. (Show Details)

The host has been rebooted on July 12th:

reboot   system boot  4.19.0-20-amd64  Tue Jul 12 15:50   still running

But somehow the docker service did not start.

Mentioned in SAL (#wikimedia-releng) [2022-07-15T12:30:27Z] <hashar> Starting docker on contint2001.wikimedia.org # T313119

hashar lowered the priority of this task from Unbreak Now! to Medium.

docker.service requires docker.socket and both are marked as not enabled (ie disabled).

On contint2001 I started the service manually and also did:

systemctl enable docker
Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable docker
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
systemctl enable docker.socket
Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.

Looking at contint1001.wikimedia.org, the symlinks do not exist:

contint1001
$ ls /etc/systemd/system/*/docker.*
ls: cannot access '/etc/systemd/system/*/docker.*': No such file or directory

And thus both units show up as disabled (upper cased DISABLED by me below):

$ systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; DISABLED; vendor preset: enabled)
   Active: active (running) since Mon 2022-06-13 04:33:29 UTC; 1 months 1 days ago
     Docs: https://docs.docker.com
 Main PID: 1653 (dockerd)
    Tasks: 48
   Memory: 7.6G
   CGroup: /system.slice/docker.service
           └─1653 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
hashar@contint1001:~$ systemctl status docker.socket
● docker.socket - Docker Socket for the API
   Loaded: loaded (/lib/systemd/system/docker.socket; DISABLED; vendor preset: enabled)
   Active: active (running) since Thu 2022-05-05 18:46:47 UTC; 2 months 9 days ago
   Listen: /var/run/docker.sock (Stream)
    Tasks: 0 (limit: 4915)
   Memory: 0B
   CGroup: /system.slice/docker.socket

Moritz instructed:

you can/should extend profile::ci::docker with a puppet-managed service for docker, it currently solely relies on OS package startup config (which may or may not be missing)

Change 814157 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: enable docker on machine start

https://gerrit.wikimedia.org/r/814157

Mentioned in SAL (#wikimedia-releng) [2022-07-15T15:46:04Z] <hashar> contint2001: docker-system-prune-dangling.service it failed overnight cause Docker was not running. That should clear Icinga state # T313119