Nodes taken offline after /var/lib/docker partition fills due to container logging
Closed, ResolvedPublic

Description

Two nodes were taken offline recently due to a full /var/lib/docker.

Agent integration-slave-docker-1040 (experimental docker node)
Oct 3, 2018 2:30:04 PM
Disconnected by SYSTEM : maintenance-disconnect-full-disks build 8087 (/var/lib/docker: 95%)
Agent integration-slave-docker-1043 (experimental docker node)
Oct 3, 2018 2:45:04 PM
Disconnected by SYSTEM : maintenance-disconnect-full-disks build 8090 (/var/lib/docker: 97%)
dduvall created this task.Oct 3 2018, 3:30 PM
dduvall triaged this task as High priority.

It appears that Docker's JSON log files from some long running Quibble containers are the culprit. See

1root@integration-slave-docker-1040:~# docker ps -s
2CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES SIZE
3a29107522c65 docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.26-1 "/usr/local/bin/quib…" 3 hours ago Up 3 hours kind_pare 461MB (virtual 1.37GB)
4d83a3b95c276 docker-registry.wikimedia.org/releng/npm-test:0.3.0 "/run.sh roundtrip" 4 days ago Up 4 days boring_villani 233kB (virtual 403MB)
5root@integration-slave-docker-1040:~# tree -s /var/lib/docker/containers
6/var/lib/docker/containers
7├── [ 4096] a29107522c6591d9a8c23087db5155584f671ac39eb1c9a40c74415dfd378493
8│   ├── [36018552832] a29107522c6591d9a8c23087db5155584f671ac39eb1c9a40c74415dfd378493-json.log
9│   ├── [ 4096] checkpoints
10│   ├── [ 6678] config.v2.json
11│   ├── [ 1494] hostconfig.json
12│   ├── [ 13] hostname
13│   ├── [ 174] hosts
14│   ├── [ 268] resolv.conf
15│   ├── [ 71] resolv.conf.hash
16│   └── [ 40] shm
17└── [ 4096] d83a3b95c276789ba9f5a249863b756a4d73b93d967722bc41552c87bfd65bcd
18 ├── [ 4096] checkpoints
19 ├── [ 6648] config.v2.json
20 ├── [ 62566] d83a3b95c276789ba9f5a249863b756a4d73b93d967722bc41552c87bfd65bcd-json.log
21 ├── [ 1509] hostconfig.json
22 ├── [ 13] hostname
23 ├── [ 174] hosts
24 ├── [ 268] resolv.conf
25 ├── [ 71] resolv.conf.hash
26 └── [ 40] shm
27
286 directories, 14 files
and
1root@integration-slave-docker-1043:~# docker ps -s
2CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES SIZE
3e222eb7fdeec docker-registry.wikimedia.org/releng/quibble-stretch-php70:0.0.26-1 "/usr/local/bin/quib…" 3 hours ago Up 3 hours dreamy_mclean 528MB (virtual 1.43GB)
4205e29bbeb54 docker-registry.wikimedia.org/releng/npm-test:0.3.0 "/run.sh" 18 hours ago Up 18 hours heuristic_chandrasekhar 233kB (virtual 403MB)
5root@integration-slave-docker-1043:~# tree -s /var/lib/docker/containers/
6/var/lib/docker/containers/
7├── [ 4096] 205e29bbeb54cee8c0aed6a15ab8b53cc0f4e18df2bac6914a1cf34d9440305a
8│   ├── [ 63880] 205e29bbeb54cee8c0aed6a15ab8b53cc0f4e18df2bac6914a1cf34d9440305a-json.log
9│   ├── [ 4096] checkpoints
10│   ├── [ 6321] config.v2.json
11│   ├── [ 1453] hostconfig.json
12│   ├── [ 13] hostname
13│   ├── [ 174] hosts
14│   ├── [ 268] resolv.conf
15│   ├── [ 71] resolv.conf.hash
16│   └── [ 40] shm
17└── [ 4096] e222eb7fdeec11e308c541297c6872bd30cad6d2b0f29c12e63e40383a7b9b7f
18 ├── [ 4096] checkpoints
19 ├── [ 6678] config.v2.json
20 ├── [36669485056] e222eb7fdeec11e308c541297c6872bd30cad6d2b0f29c12e63e40383a7b9b7f-json.log
21 ├── [ 1488] hostconfig.json
22 ├── [ 13] hostname
23 ├── [ 174] hosts
24 ├── [ 268] resolv.conf
25 ├── [ 71] resolv.conf.hash
26 └── [ 40] shm
27
286 directories, 14 files
.

I'm not sure why the Quibble job is outputting so much, but regardless we should probably limit the Docker log size to avoid these issues. I don't actually see a reason to have a large log file at all given output is being streamed to Jenkins.

Mentioned in SAL (#wikimedia-releng) [2018-10-03T15:46:19Z] <marxarelli> bringing integration-slave-docker-1040/1043 nodes back online after killing long running docker job and freeing up /var/lib/docker space (T206134)

Change 464174 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[operations/puppet@production] ci: Disable Docker container logging

https://gerrit.wikimedia.org/r/464174

Change 464174 merged by Alexandros Kosiaris:
[operations/puppet@production] ci: Disable Docker container logging

https://gerrit.wikimedia.org/r/464174

hashar added a comment.Oct 5 2018, 1:30 PM

Would there be anyway to retrieve/inspect the generated log? Maybe some process goes wild when the build is aborted (Jenkins sends a SIGTERM).

dduvall renamed this task from Nodes taken offline due to full /var/lib/docker partition to Nodes taken offline after /var/lib/docker partition fills due to container logging.Oct 5 2018, 9:03 PM

Would there be anyway to retrieve/inspect the generated log? Maybe some process goes wild when the build is aborted (Jenkins sends a SIGTERM).

The logs were only accessible while the containers were running since we give the --rm option to docker run, and now that the patch has merged and daemons have been restarted, container output logging is completely disabled.

See T198517: Quibble docker instance running on CI instance for 6 hours for figuring out how to get Jenkins jobs to properly ensure containers are stopped/killed.

dduvall closed this task as Resolved.Oct 5 2018, 10:15 PM