Page MenuHomePhabricator

alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up
Closed, ResolvedPublic

Assigned To
Authored By
aborrero
Jan 31 2025, 10:11 AM
Referenced Files
F58322573: Screenshot 2025-01-31 at 11.22.30.png
Jan 31 2025, 10:23 AM
F58322570: Screenshot 2025-01-31 at 11.22.22.png
Jan 31 2025, 10:23 AM
F58322498: image.png
Jan 31 2025, 10:11 AM
F58322487: image.png
Jan 31 2025, 10:11 AM
F58322481: image.png
Jan 31 2025, 10:11 AM

Description

There was this alert today:

FIRING: MaintainKubeusersDown: maintain-kubeusers is down - https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/MaintainKubeusersDown  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DMaintainKubeusersDown

image.png (337×533 px, 53 KB)

However the daemon is up and running:

image.png (545×1 px, 148 KB)

And prometheus has metrics about it being UP always in the last two weeks.

image.png (990×1 px, 88 KB)

Similar thing for tools-redis-7 that is reported down, but is actually up.

Event Timeline

There is some inconsistency in the metrics data: depending on the time window I select in Grafana, I get different values for the same time (check for example 4:00 UTC in the charts below).

These values are from a different metric than the one reported by @aborrero above, but they seem to show a similar issue.

Screenshot 2025-01-31 at 11.22.22.png (1×3 px, 368 KB)

Screenshot 2025-01-31 at 11.22.30.png (1×3 px, 369 KB)

The MaintainKubeusersDown is NOT firing if I look at https://prometheus.svc.toolforge.org/tools/alerts?search=maintain but it IS firing if I look at https://alerts.wikimedia.org/?q=team%3Dwmcs&q=alertname%3DMaintainKubeusersDown

I suspect the 2 prometheus servers we have for wmcloud are not in sync and are reporting different things.

Yes the 2 servers are not in sync:

root@tools-prometheus-6:~# promtool query instant http://localhost:9902/tools 'up{job="k8s-maintain-kubeusers"}'
up{instance="k8s.tools.eqiad1.wikimedia.cloud:6443", job="k8s-maintain-kubeusers", pod_label_app="maintain-kubeusers", pod_label_pod_template_hash="6d78c5d7c", pod_name="maintain-kubeusers-6d78c5d7c-v6hf9"} => 1 @[1738321018.58]
root@tools-prometheus-7:~#  promtool query instant http://localhost:9902/tools 'up{job="k8s-maintain-kubeusers"}'
up{instance="k8s.tools.eqiad1.wikimedia.cloud:6443", job="k8s-maintain-kubeusers", pod_label_app="maintain-kubeusers", pod_label_pod_template_hash="6d78c5d7c", pod_name="maintain-kubeusers-6d78c5d7c-v6hf9"} => 0 @[1738321028.234]

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T11:04:43Z] <dhinus> systemctl restart prometheus@tools on tools-prometheus-7 T385262

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T11:16:55Z] <dhinus> systemctl restart prometheus@cloud on metricsinfra-prometheus-3 T385262

fnegri renamed this task from toolforge: alertmanager reports maintain-kubeusers as down, but it isn't to alertmanager reports maintain-kubeusers and tools-redis-7 as down, but they are up.Jan 31 2025, 11:20 AM
fnegri updated the task description. (Show Details)

The restart fixed the k8s-maintain-kubeusers metric that is now reporting the same value in both servers:

fnegri@tools-prometheus-6:~$ promtool query instant http://localhost:9902/tools 'up{job="k8s-maintain-kubeusers"}'
up{instance="k8s.tools.eqiad1.wikimedia.cloud:6443", job="k8s-maintain-kubeusers", pod_label_app="maintain-kubeusers", pod_label_pod_template_hash="6d78c5d7c", pod_name="maintain-kubeusers-6d78c5d7c-v6hf9"} => 1 @[1738322501.644]
fnegri@tools-prometheus-7:~$ promtool query instant http://localhost:9902/tools 'up{job="k8s-maintain-kubeusers"}'
up{instance="k8s.tools.eqiad1.wikimedia.cloud:6443", job="k8s-maintain-kubeusers", pod_label_app="maintain-kubeusers", pod_label_pod_template_hash="6d78c5d7c", pod_name="maintain-kubeusers-6d78c5d7c-v6hf9"} => 1 @[1738322503.301]

The tools-redis-7 metric is still out of sync:

fnegri@metricsinfra-prometheus-2:~$ promtool query instant http://localhost:9900 'up{job="node",instance="tools-redis-7"}'
up{instance="tools-redis-7", job="node", project="tools"} => 1 @[1738322586.712]
root@metricsinfra-prometheus-3:~# promtool query instant http://localhost:9900 'up{job="node",instance="tools-redis-7"}'
up{instance="tools-redis-7", job="node", project="tools"} => 0 @[1738322587.915]

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T11:30:25Z] <dhinus> systemctl restart prometheus@cloud on metricsinfra-prometheus-2 T385262

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T11:38:39Z] <dhinus> rebooting VM metricsinfra-prometheus-3 T385262

Even after a VM reboot, metricsinfra-prometheus-3 is still showing the wrong value:

root@metricsinfra-prometheus-3:~# promtool query instant http://localhost:9900 'up{job="node",instance="tools-redis-7"}'
up{instance="tools-redis-7", job="node", project="tools"} => 0 @[1738323554.851]
fnegri changed the task status from Open to In Progress.Jan 31 2025, 11:42 AM
fnegri claimed this task.

Apparently it's a networking issues, as I cannot ping tools-redis-7 from metricsinfra-prometheus-3:

fnegri@metricsinfra-prometheus-3:~$ ping 172.16.2.46
PING 172.16.2.46 (172.16.2.46) 56(84) bytes of data.
^C
--- 172.16.2.46 ping statistics ---
35 packets transmitted, 0 received, 100% packet loss, time 34654ms

But I can from metricsinfra-prometheus-2:

fnegri@metricsinfra-prometheus-2:~$ ping 172.16.2.46
PING 172.16.2.46 (172.16.2.46) 56(84) bytes of data.
64 bytes from 172.16.2.46: icmp_seq=1 ttl=64 time=2.58 ms
64 bytes from 172.16.2.46: icmp_seq=2 ttl=64 time=0.688 ms
64 bytes from 172.16.2.46: icmp_seq=3 ttl=64 time=0.499 ms
64 bytes from 172.16.2.46: icmp_seq=4 ttl=64 time=0.359 ms
^C
--- 172.16.2.46 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3037ms
rtt min/avg/max/mdev = 0.359/1.032/2.584/0.903 ms

I can now ping the same IP successfully, and the alert is gone.

Maybe related, tools-redis-7 was rebooted today at 13:13 UTC.