Page MenuHomePhabricator

Implement Prometheus exporter for Ganeti capacity data
Closed, ResolvedPublic

Description

Resource usage from Ganeti needs to be collected to do capacity management and alerting.
Data must be collected by Prometheus, to allow visualization in Grafana and alerting via AlertManager.

Event Timeline

SLyngshede-WMF changed the task status from Open to In Progress.Jun 24 2022, 8:41 AM
SLyngshede-WMF triaged this task as Low priority.

Change 804276 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/debs/prometheus-ganeti-exporter@master] Ganeti Prometheus exporter, initial checkin

https://gerrit.wikimedia.org/r/804276

Change 804276 merged by Slyngshede:

[operations/debs/prometheus-ganeti-exporter@master] Ganeti Prometheus exporter, initial checkin

https://gerrit.wikimedia.org/r/804276

Change 809160 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] C:ganeti enable ganeti prometheus exporter.

https://gerrit.wikimedia.org/r/809160

Change 809160 merged by Slyngshede:

[operations/puppet@production] C:ganeti enable ganeti prometheus exporter.

https://gerrit.wikimedia.org/r/809160

Change 809533 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] profile::prometheus::ops add ganeti cluster targets

https://gerrit.wikimedia.org/r/809533

Change 809533 merged by Slyngshede:

[operations/puppet@production] profile::prometheus::ops add ganeti cluster targets

https://gerrit.wikimedia.org/r/809533

Change 819061 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/debs/prometheus-ganeti-exporter@master] Bump version number to 0.2

https://gerrit.wikimedia.org/r/819061

Change 819061 merged by Slyngshede:

[operations/debs/prometheus-ganeti-exporter@master] Bump version number to 0.2

https://gerrit.wikimedia.org/r/819061

Ganeti exporter has been unavailable since 20:17:30:
https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1

While debugging, I saw some issues:

  • Certificate for ganeti01.svc.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now a warning that is happening very frequently and dominating the log entries (this is an old issue, unrelated to the current unavailability).
  • There seems to be a fatal error (exception) but the prometheus service seems to be up (not sure this is an issue, maybe the exception only happens on a thread), but if it is indeed a fatal error, the daemon should probably complain... maybe... for an easier monitoring of monitoring.
  • The main issue causing a crash, and the monitoring unavailability: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: Exception happened during processing of request from ('10.64.16.
62', 37348)
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: Traceback (most recent call last):
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3.7/socketserver.py", line 650, in proces
s_request_thread
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     self.finish_request(request, client_address)
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3.7/socketserver.py", line 360, in finish
_request
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     self.RequestHandlerClass(request, client_address, self)
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3.7/socketserver.py", line 720, in __init
__
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     self.handle()
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3.7/http/server.py", line 426, in handle
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     self.handle_one_request()
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3.7/http/server.py", line 414, in handle_
one_request
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     method()
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3/dist-packages/prometheus_client/exposit
ion.py", line 151, in do_GET
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     output = encoder(registry)
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3/dist-packages/prometheus_client/openmet
rics/exposition.py", line 14, in generate_latest
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     for metric in registry.collect():
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3/dist-packages/prometheus_client/registry.py", line 75, in collect
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     for metric in collector.collect():
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "<decorator-gen-1>", line 2, in collect
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/lib/python3/dist-packages/prometheus_client/context_managers.py", line 66, in wrapped
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     return func(*args, **kwargs)
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/bin/prometheus-ganeti-exporter", line 229, in collect
ct
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     metrics.extend(self.collect_vcpu_allocation(nodes, instances))
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/bin/prometheus-ganeti-exporter", line 194, in collect_vcpu_allocation
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     metrics.append(self.cpu_allocation_per_node(node, instances, primary=False))
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:   File "/usr/bin/prometheus-ganeti-exporter", line 183, in cpu_allocation_per_node
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]:     sum([instance['oper_vcpus'] for instance in allocated]))
Sep 12 20:16:49 ganeti1027 prometheus-ganeti-exporter[10229]: TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

One of the hosts actually do report having "None" oper_vcpus, rather than 0.

instances[30]
{'disk_usage': 51328, 'oper_vcpus': None, 'nic.uuids': ['8c19a789-81a8-4780-8...56ca80b30f'], 'serial_no': 2, 'ctime': 1663013732.788695, 'hvparams': {'spice_password_file': '', 'nic_type': 'paravirtual', 'use_localtime': False, 'root_path': '/dev/vda1', 'spice_use_tls': False, 'vnc_x509_path': '', 'vnc_bind_address': '', 'vnc_password_file': '', 'cdrom2_image_path': '', ...}, 'oper_state': False, 'disk_template': 'drbd', 'disk.spindles': [None], 'mtime': 1663013732.884941, 'nic.modes': ['bridged'], 'oper_ram': None, 'nic.networks.names': [None], 'pnode': 'ganeti1026.eqiad.wmnet', ...}

One of the hosts actually do report having "None" oper_vcpus, rather than 0.

instances[30]
{'disk_usage': 51328, 'oper_vcpus': None, 'nic.uuids': ['8c19a789-81a8-4780-8...56ca80b30f'], 'serial_no': 2, 'ctime': 1663013732.788695, 'hvparams': {'spice_password_file': '', 'nic_type': 'paravirtual', 'use_localtime': False, 'root_path': '/dev/vda1', 'spice_use_tls': False, 'vnc_x509_path': '', 'vnc_bind_address': '', 'vnc_password_file': '', 'cdrom2_image_path': '', ...}, 'oper_state': False, 'disk_template': 'drbd', 'disk.spindles': [None], 'mtime': 1663013732.884941, 'nic.modes': ['bridged'], 'oper_ram': None, 'nic.networks.names': [None], 'pnode': 'ganeti1026.eqiad.wmnet', ...}

Unlike the other data centers, the eqiad cluster is still running Ganeti 2 which is written in Python 2, that sounds like some underlying type conversion issue which has bubbled up via the API, I'd say we can simply workaround it in the exporter.

The problem is this host: dispatch-be1001.eqiad.wmnet which is configured to be down. It does in fact have no vCPUs allocated.

sudo gnt-instance info dispatch-be1001.eqiad.wmnet
- Instance name: dispatch-be1001.eqiad.wmnet
  UUID: d7e7d86b-4772-4093-b9a6-e67dd992ff1f
  Serial number: 2
  Creation time: 2022-09-12 20:15:32
  Modification time: 2022-09-12 20:15:32
  State: configured to be down, actual state is down
  Nodes:
    - primary: ganeti1026.eqiad.wmnet
      group: A (UUID c5933098-a0d9-4844-a86b-70f160d2de63)
    - secondaries: ganeti1006.eqiad.wmnet (group A, group UUID c5933098-a0d9-4844-a86b-70f160d2de63)

The problem is this host: dispatch-be1001.eqiad.wmnet which is configured to be down. It does in fact have no vCPUs allocated.

sudo gnt-instance info dispatch-be1001.eqiad.wmnet
- Instance name: dispatch-be1001.eqiad.wmnet
  UUID: d7e7d86b-4772-4093-b9a6-e67dd992ff1f
  Serial number: 2
  Creation time: 2022-09-12 20:15:32
  Modification time: 2022-09-12 20:15:32
  State: configured to be down, actual state is down

Ah, indeed. That explain, this was created yesterday by Keith, but isn't installed yet. We'll always have such WIP-VMs, so let's in this case (i.e. VM is down and reporting None CPUs) simply treat that as 0

Change 831871 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/debs/prometheus-ganeti-exporter@master] Downed VMs will report None as vCPU allocation.

https://gerrit.wikimedia.org/r/831871

The patch use the oper_state of the instances, rather than just assuming that None should be 0. It's much the same result, but it feels more correct.

Change 831871 merged by Slyngshede:

[operations/debs/prometheus-ganeti-exporter@master] Downed VMs will report None as vCPU allocation.

https://gerrit.wikimedia.org/r/831871