Page MenuHomePhabricator

node exporter for cloud hosts is missing metrics for /var/lib/docker
Closed, ResolvedPublic

Description

As part of T327435 we have been creating some grafana dashboards to show disk space usage on gitlab-runners.

This included using grafana-cloud to do this for gitlab-runners in wmcs.

https://grafana-cloud.wikimedia.org/d/FrErwP0Vk/gitlab-runner-overview?orgId=1&from=now-7d&to=no

While doing so we noticed the mount point that interests us most, /var/lib/docker is missing from exporter data.

example, it is shown in df -h:

dzahn@runner-1029:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             12G     0   12G   0% /dev
tmpfs           2.4G  620K  2.4G   1% /run
/dev/sdb1        20G  7.4G   12G  40% /
tmpfs            12G     0   12G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/sda         40G  7.1G   30G  20% /var/lib/docker
/dev/sdb15      124M   11M  114M   9% /boot/efi
tmpfs           2.4G     0  2.4G   0% /run/user/0
tmpfs           2.4G     0  2.4G   0% /run/user/2075

But when asking the exporter for metrics it's notably missing, while the other mount points are there:

dzahn@runner-1029:~$ curl 172.16.1.135:9100/metrics | grep "node_filesystem_avail_bytes"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/sdb1",fstype="ext4",mountpoint="/"} 1.2129357824e+10
node_filesystem_avail_bytes{device="/dev/sdb15",fstype="vfat",mountpoint="/boot/efi"} 1.18601728e+08
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run"} 2.521239552e+09
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/lock"} 5.24288e+06
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/0"} 2.521870336e+09
node_filesystem_avail_bytes{device="tmpfs",fstype="tmpfs",mountpoint="/run/user/2075"} 2.521870336e+09
100  214k    0  214k    0     0  2211k      0 --:--:-- --:--:-- --:--:-- 2211k

Is this being excluded somewhere and can it be added, please?

Event Timeline

So because something spammed the logs in 2018 due to permissions.. asking about this in 2023 for a valid use case is declined within minutes?

I closed this task because as far as I can tell that previous task from a few years back was about this exact same issue and nothing in this task suggests that something has changed since this issue was last investigated. If you have something that suggests that something has changed, feel free to add that, but for now I don't see anything actionable here that justifies keeping the task open.

@thcipriani @dancy T327060 was needed for buildkit to make automatic builds work for production which needed T327435 which needed this ticket which is declined as you can see above. this blocks efforts from your sprint week

nothing in this task suggests that something has changed since this issue was last investigated.

This ticket was asking to change it, which you declined with the reason that it has not changed.

I don't see anything actionable here that justifies keeping the task open.

The action would be to stop filtering /var/lib/docker which is why I created the ticket. It was declined without any hint of discussion within 5 minutes.

nskaggs subscribed.

I'd like to invite more discussion on this topic. I trust we all would like nice gitlab-runners.

Given the age of the previous decision, and as I understand from the ticket, it's implementation a workaround in response to exhibited behavior of the prometheus-node-exporter, it's worth exploring if the issue is still present. If so, we can discuss alternative options to enable this request without causing spam. How might we check to see if the previous issue is still recurring?

Is there a local puppetmaster in the gitlab-runners project where someone could try removing the ban on /var/lib/docker from https://gerrit.wikimedia.org/r/c/operations/puppet/+/479424/2/modules/prometheus/manifests/node_exporter.pp#35 to see what errors if any come out today?

Yes, there is a local puppetmaster. gitlab-runners-puppetmaster-01.

I just made an edit to /etc/puppet/modules/prometheus/manifests/node_exporter line 34 to remove /var/lib/docker from the $ignored_mount_points.

Thanks for the hint @bd808 !

Metrics for /var/lib/docker are available in devtools project now.

runner-1029:~$ curl 172.16.1.135:9100/metrics | grep "node_filesystem_avail_bytes"
# TYPE node_filesystem_avail_bytes gauge
node_filesystem_avail_bytes{device="/dev/sda",fstype="ext4",mountpoint="/var/lib/docker"} 3.0610907136e+10
...

Also the dashboard for Shared Runners contains /var/lib/docker now:
https://grafana-cloud.wikimedia.org/d/FrErwP0Vk/gitlab-runner-overview?orgId=1&from=now-12h&to=now

2023-02-07-shared-runners-disk-space.png (540×1 px, 109 KB)

Metrics for /var/lib/docker are available in devtools project now.

Someone should check back after a few full days of operation, but at the moment fgrep prometheus-node-exporter /var/log/*.log on runner-1029.gitlab-runners.eqiad1.wikimedia.cloud is not showing file system read permission spam as previously seen in T211810: tools-workers: prometheus-node-exporter `Error on statfs() system call for... permission denied`.

Wow, well, I expected either it would work OR it would spam again, but not both at the same time. So it gets "permission denied" but still does get the data we want.. is that odd?

Wow, well, I expected either it would work OR it would spam again, but not both at the same time. So it gets "permission denied" but still does get the data we want.. is that odd?

I missed typing "not" in my comment which unfortunately inverted the intent of the update. :) This has now been corrected.

oh:) that's good news! cool, thank you

Change 888009 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points

https://gerrit.wikimedia.org/r/888009

Change 888009 merged by Jelto:

[operations/puppet@production] prometheus::node_exporter: remove /var/lib/docker from ignored_mount_points

https://gerrit.wikimedia.org/r/888009

Jelto claimed this task.
Jelto triaged this task as Medium priority.

I added prometheus::node_exporter::ignored_mount_points: '^/(sys|proc|dev|var/lib/kubelet)($|/)' in the change above and removed the manual change in modules/prometheus/manifests/node_exporter.pp on the local puppetmaster in gitlab-runners WMCS project.

I'm closing this task, as metrics are available now (also with the new ignored_mount_points list in hiera).