Resource usage from Ganeti needs to be collected to do capacity management and alerting.
Data must be collected by Prometheus, to allow visualization in Grafana and alerting via AlertManager.
Description
Details
Event Timeline
Change 804276 had a related patch set uploaded (by Slyngshede; author: Slyngshede):
[operations/debs/prometheus-ganeti-exporter@master] Ganeti Prometheus exporter, initial checkin
Change 804276 merged by Slyngshede:
[operations/debs/prometheus-ganeti-exporter@master] Ganeti Prometheus exporter, initial checkin
Change 809160 had a related patch set uploaded (by Slyngshede; author: Slyngshede):
[operations/puppet@production] C:ganeti enable ganeti prometheus exporter.
Change 809160 merged by Slyngshede:
[operations/puppet@production] C:ganeti enable ganeti prometheus exporter.
Change 809533 had a related patch set uploaded (by Slyngshede; author: Slyngshede):
[operations/puppet@production] profile::prometheus::ops add ganeti cluster targets
Change 809533 merged by Slyngshede:
[operations/puppet@production] profile::prometheus::ops add ganeti cluster targets
Change 819061 had a related patch set uploaded (by Slyngshede; author: Slyngshede):
[operations/debs/prometheus-ganeti-exporter@master] Bump version number to 0.2
Change 819061 merged by Slyngshede:
[operations/debs/prometheus-ganeti-exporter@master] Bump version number to 0.2
Ganeti exporter has been unavailable since 20:17:30:
https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1
While debugging, I saw some issues:
- Certificate for ganeti01.svc.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now a warning that is happening very frequently and dominating the log entries (this is an old issue, unrelated to the current unavailability).
- There seems to be a fatal error (exception) but the prometheus service seems to be up (not sure this is an issue, maybe the exception only happens on a thread), but if it is indeed a fatal error, the daemon should probably complain... maybe... for an easier monitoring of monitoring.
- The main issue causing a crash, and the monitoring unavailability: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: Exception happened during processing of request from ('10.64.16. 62', 37348) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: Traceback (most recent call last): Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3.7/socketserver.py", line 650, in proces s_request_thread Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: self.finish_request(request, client_address) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3.7/socketserver.py", line 360, in finish _request Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: self.RequestHandlerClass(request, client_address, self) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3.7/socketserver.py", line 720, in __init __ Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: self.handle() Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3.7/http/server.py", line 426, in handle Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: self.handle_one_request() Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3.7/http/server.py", line 414, in handle_ one_request Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: method() Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3/dist-packages/prometheus_client/exposit ion.py", line 151, in do_GET Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: output = encoder(registry) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3/dist-packages/prometheus_client/openmet rics/exposition.py", line 14, in generate_latest Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: for metric in registry.collect(): Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3/dist-packages/prometheus_client/registry.py", line 75, in collect Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: for metric in collector.collect(): Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "<decorator-gen-1>", line 2, in collect Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/lib/python3/dist-packages/prometheus_client/context_managers.py", line 66, in wrapped Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: return func(*args, **kwargs) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/bin/prometheus-ganeti-exporter", line 229, in collect ct Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: metrics.extend(self.collect_vcpu_allocation(nodes, instances)) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/bin/prometheus-ganeti-exporter", line 194, in collect_vcpu_allocation Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: metrics.append(self.cpu_allocation_per_node(node, instances, primary=False)) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: File "/usr/bin/prometheus-ganeti-exporter", line 183, in cpu_allocation_per_node Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: sum([instance['oper_vcpus'] for instance in allocated])) Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
One of the hosts actually do report having "None" oper_vcpus, rather than 0.
instances[30] {'disk_usage': 51328, 'oper_vcpus': None, 'nic.uuids': ['8c19a789-81a8-4780-8...56ca80b30f'], 'serial_no': 2, 'ctime': 1663013732.788695, 'hvparams': {'spice_password_file': '', 'nic_type': 'paravirtual', 'use_localtime': False, 'root_path': '/dev/vda1', 'spice_use_tls': False, 'vnc_x509_path': '', 'vnc_bind_address': '', 'vnc_password_file': '', 'cdrom2_image_path': '', ...}, 'oper_state': False, 'disk_template': 'drbd', 'disk.spindles': [None], 'mtime': 1663013732.884941, 'nic.modes': ['bridged'], 'oper_ram': None, 'nic.networks.names': [None], 'pnode': 'ganeti1026.eqiad.wmnet', ...}
Unlike the other data centers, the eqiad cluster is still running Ganeti 2 which is written in Python 2, that sounds like some underlying type conversion issue which has bubbled up via the API, I'd say we can simply workaround it in the exporter.
The problem is this host: dispatch-be1001.eqiad.wmnet which is configured to be down. It does in fact have no vCPUs allocated.
sudo gnt-instance info dispatch-be1001.eqiad.wmnet - Instance name: dispatch-be1001.eqiad.wmnet UUID: d7e7d86b-4772-4093-b9a6-e67dd992ff1f Serial number: 2 Creation time: 2022-09-12 20:15:32 Modification time: 2022-09-12 20:15:32 State: configured to be down, actual state is down Nodes: - primary: ganeti1026.eqiad.wmnet group: A (UUID c5933098-a0d9-4844-a86b-70f160d2de63) - secondaries: ganeti1006.eqiad.wmnet (group A, group UUID c5933098-a0d9-4844-a86b-70f160d2de63)
Ah, indeed. That explain, this was created yesterday by Keith, but isn't installed yet. We'll always have such WIP-VMs, so let's in this case (i.e. VM is down and reporting None CPUs) simply treat that as 0
Change 831871 had a related patch set uploaded (by Slyngshede; author: Slyngshede):
[operations/debs/prometheus-ganeti-exporter@master] Downed VMs will report None as vCPU allocation.
The patch use the oper_state of the instances, rather than just assuming that None should be 0. It's much the same result, but it feels more correct.
Change 831871 merged by Slyngshede:
[operations/debs/prometheus-ganeti-exporter@master] Downed VMs will report None as vCPU allocation.