Page MenuHomePhabricator

Icinga disk space alert when a Docker container is running on an host
Closed, ResolvedPublic

Description

** PROBLEM alert - contint1001/Disk space is CRITICAL **
Notification Type: PROBLEM

Service: Disk space
Host: contint1001
Address: 208.80.154.17
State: CRITICAL

Date/Time: Tue Oct 17 19:47:44 UTC 2017

Additional Info:

DISK CRITICAL - /var/lib/docker/overlay2/35ca40c8e8fcc59fd40848e1a0c40275d7f2db69a5a57323328ae88010578006/merged is not accessible: Permission denied

Happens whenever running a container.

The check comes from the Puppet class base::monitoring::host. It defines a check_disk which process most mounted file systems.

From mount:

overlay on /var/lib/docker/overlay2/.../merged type overlay (rw,relatime,lowerdir=...,upperdir=.../diff,workdir=.../work)

df properly skips it

There is a similar issue with the Diamond collector DiskSpace: T177052

Event Timeline

For a running Docker container we have:

overlay on /var/lib/docker/overlay2/.../merged type overlay (rw,relatime,lowerdir=...,upperdir=.../diff,workdir=.../work)
shm on /var/lib/docker/containers/.../shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
nsfs on /run/docker/netns/1b32c1f94a2a type nsfs (rw)

I think part of the issue is that check_disk is being passed -A which checks all mounts. They are then "manually" filtered out per disk/FS type. So whenever a new FS type or partition pops out we have to add the filter for it. That got introduced via 1ed33150637f7b150c3fdc53a60d24612040a28a with the comment:

the -A -i ... part is a gross hack to workaround Varnish partitions that are purposefully at 99%. Better ideas are welcome.

We can add --exclude-type=overlay, then check_disk chokes on the tmpfs mount:

$ /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs --exclude-type=overlay --verbose
DISK CRITICAL - /var/lib/docker/containers/.../shm is not accessible: Permission denied

And then on the nsfs one:

$ /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs --exclude-type=overlay --exclude-type=tmpfs
DISK CRITICAL - /run/docker/netns/1b32c1f94a2a is not accessible: Permission denied

cd4ee204504e8da0d09cc1e84670271f6b91a9db added support to override the check disk options via hiera. Maybe we can drop the -A -i hack from the default definition and add it solely for the varnish hosts?

This also depends on the storage driver used. So we could use:

/usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs --exclude-type=overlayfs --exclude-type=tmpfs --exclude-type=nsfs --verbose

But then there are other storage drivers aside from overlayfs to consider: https://docs.docker.com/engine/userguide/storagedriver/selectadriver/#docker-ce

thcipriani triaged this task as Medium priority.Oct 19 2017, 3:17 PM

I disabled Icinga notifications for check_disk on contint1001 with a link to this ticket.

This comment should remind us to enable them again when the ticket is resolved.

Acknowledgements don't cut it because it's flapping but disabling notifications is easily forgotten and needs manual revert.

Change 393570 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Ignore docker containers in disk space checks

https://gerrit.wikimedia.org/r/393570

Change 393570 merged by Alexandros Kosiaris:
[operations/puppet@production] Ignore docker containers in disk space checks

https://gerrit.wikimedia.org/r/393570

Change 393572 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Ignore all of /var/lib/docker

https://gerrit.wikimedia.org/r/393572

Change 393572 merged by Alexandros Kosiaris:
[operations/puppet@production] Ignore all of /var/lib/docker

https://gerrit.wikimedia.org/r/393572

akosiaris claimed this task.
akosiaris added a subscriber: akosiaris.

I think I 've solved this for now with pretty much the same approach as in the kubernetes clusters. I 'll resolve, feel free to reopen and propose different/better approaches

/var/lib/docker sounds good enough for now and I noticed you also exclude /run/docker/netns/*!

I have spawned a container on contint1001:

sudo docker run -it --rm --entrypoint=/bin/bash docker-registry.discovery.wmnet/releng/ci-jessie

And then I guess enable back the Icinga notification for Disk space ( https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=contint1001&service=Disk+space ).

Mentioned in SAL (#wikimedia-operations) [2017-11-27T11:27:24Z] <hashar> contint1001 enable Icinga "Disks space" notification again. It is no more complaing about Docker partitions | ping mutante | T178454

Is https://gerrit.wikimedia.org/r/#/c/393215/ also this ticket? It links to T177052 which seems somewhat related but not about Icinga, just Grafana itself.

Is https://gerrit.wikimedia.org/r/#/c/393215/ also this ticket? It links to T177052 which seems somewhat related but not about Icinga, just Grafana itself.

Yup they have the same root cause (Docker ephmereal mounts being monitored). Though this ticket is about Icinga on production while the other ticket is about Graphite metrics :]

sorry to say, but there is one of these in Icinga again ...

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lawrencium&service=Disk+space

DISK CRITICAL - /var/lib/docker/overlay2/c1dc9ea10b1d7fc55a9778b0abd9894ccd3eb7520b928bf68a1a626d9304fd16/merged is not accessible: Permission denied

@Dzahn that is on lawrencium . Can you check the content of /etc/nagios/nrpe.d/check_disk_space.cfg there please? On contint1001 that is:

# File generated by puppet. DO NOT edit by hand
command[check_disk_space]=/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -W 6% -K 3% -l -e -A -i /var/lib/docker/* -i /run/docker/netns/* --exclude-type=tracefs

If I spawn a container on contint1001 ( sudo docker run -it --rm --entrypoint=/bin/bash docker-registry.discovery.wmnet/releng/ci-jessie ), the check command is all happy:

/usr/lib/nagios/plugins/check_disk -w 10% -c 5% -W 6% -K 3% -l -e -A -i /var/lib/docker/* -i /run/docker/netns/* --exclude-type=tracefs
DISK OK| /dev=0MB;9;9;0;10 /run=952MB;11580;12223;0;12867 /=28258MB;42095;44434;0;46773 /dev/shm=0MB;28950;30558;0;32167 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;28950;30558;0;32167 /srv=262587MB;801691;846229;0;890768

Note that contint1001 is on Jessie. If lawrencium is on stretch, maybe the plugin is slightly different.

[lawrencium:~] $ cat /etc/nagios/nrpe.d/check_disk_space.cfg
# File generated by puppet. DO NOT edit by hand
command[check_disk_space]=/usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs
[lawrencium:~] $ /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -W 6% -K 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs
DISK CRITICAL - /var/lib/docker/overlay2/c1dc9ea10b1d7fc55a9778b0abd9894ccd3eb7520b928bf68a1a626d9304fd16/merged is not accessible: Permission denied

And yes, lawrencium is on stretch.

The difference between our commands is the "-i run/docker/netns/*" and it works when using that:

[lawrencium:~] $ /usr/lib/nagios/plugins/check_disk -w 10% -c 5% -W 6% -K 3% -l -e -A -i /var/lib/docker/* -i /run/docker/netns/* --exclude-type=tracefs
DISK OK| /dev=0MB;43478;45893;0;48309 /run=898MB;8697;9180;0;9664 /=10963MB;33559;35423;0;37288 /dev/shm=0MB;43488;45904;0;48321 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;43488;45904;0;48321 /srv=271MB;215703;227686;0;239670 /run/user/2075=0MB;8697;9180;0;9664

Change 398172 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: fix check_disk options on lawrencium

https://gerrit.wikimedia.org/r/398172

Change 398172 merged by Dzahn:
[operations/puppet@production] icinga: fix check_disk options on lawrencium

https://gerrit.wikimedia.org/r/398172

@hashar fixed by adding the right check_disk options into Hiera, by host name in this case because it's just using role(test) and we don't want to change it for all hosts using test.

Current Status: OK
(for 0d 0h 1m 50s)
Status Information: DISK OK

A more interesting question would be why lawrencium has docker installed. It's because of T179968 and it's marked as temporary (started on Nov 7). That box should be reclaimed when performance is done with their tests, I am guessing not yet