Page MenuHomePhabricator

[openstack] prometheus exporter broken in bookworm
Closed, ResolvedPublic

Description

The first Openstack server in Eqiad that was reimaged to Bookworm in T345811 is cloudcontrol1007.

prometheus-openstack-exporter is failing to start:

Oct 31 14:54:01 cloudcontrol1007 systemd[1]: Started prometheus-openstack-exporter.service - prometheus openstack exporter.
Oct 31 14:54:01 cloudcontrol1007 sudo[144819]:     root : PWD=/ ; USER=prometheus ; COMMAND=/usr/bin/prometheus-openstack-exporter --web.listen-address=:12345 --os->
Oct 31 14:54:01 cloudcontrol1007 sudo[144819]: pam_unix(sudo:session): session opened for user prometheus(uid=106) by (uid=0)
Oct 31 14:54:01 cloudcontrol1007 sudo[144819]: pam_unix(sudo:session): session closed for user prometheus
Oct 31 14:54:01 cloudcontrol1007 systemd[1]: prometheus-openstack-exporter.service: Main process exited, code=exited, status=1/FAILURE
Oct 31 14:54:01 cloudcontrol1007 systemd[1]: prometheus-openstack-exporter.service: Failed with result 'exit-code'.

We didn't have this issue when we upgraded codfw servers to bookworm, because we only run the exporter in the Eqiad cluster, as we don't have a Prometheus instance in codfw (T350010: Deploy 'cloud' Prometheus instance to codfw).

We could still run the exporter in codfw even if there's no Prometheus polling it, just to be able to identify issues such as this one when testing in codfw.

Event Timeline

The systemd unit calls /usr/local/sbin/prometheus-openstack-exporter-wrapper that in turn calls /usr/bin/prometheus-openstack-exporter. The latter fails with:

/usr/bin/prometheus-openstack-exporter
Traceback (most recent call last):
  File "/usr/bin/prometheus-openstack-exporter", line 31, in <module>
    import urlparse
ModuleNotFoundError: No module named 'urlparse'

I think the current exporter version we're supposed to be using is written in Go, so that seems very wrong.

There's a very old version installed of that package in cloudcontrol1007:

ii  prometheus-openstack-exporter 0.1.4-2.2    all          Prometheus exporter for Openstack

While in cloudcontrol1006:

ii  prometheus-openstack-exporter 1.5.0-1      amd64        openstack exporter for prometheus

Looks like we pull the new version from a local apt.wm.o component:

taavi@cloudcontrol1006 ~ $ apt-cache policy prometheus-openstack-exporter
prometheus-openstack-exporter:
  Installed: 1.5.0-1
  Candidate: 1.5.0-1
  Version table:
 *** 1.5.0-1 500
        500 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/component/prometheus-openstack-exporter amd64 Packages
        100 /var/lib/dpkg/status
     0.1.4-2.2 500
        500 http://mirrors.wikimedia.org/debian bookworm/main amd64 Packages

I guess the package was built and uploaded only to bullseye-wikimedia, we need to have the same package under bookworm-wikimedia.

Now if only I could find a guide on how to do that :D

Yes, but you need to define the component in the reprepro config file first.

ah-ha here's why the reprepro command was failing!

Change 970430 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Add component/prometheus-openstack-exporter to bookworm

https://gerrit.wikimedia.org/r/970430

Change 970430 merged by FNegri:

[operations/puppet@production] Add component/prometheus-openstack-exporter to bookworm

https://gerrit.wikimedia.org/r/970430

This seems to have worked:

root@apt1001:~# reprepro -C component/prometheus-openstack-exporter copy bookworm-wikimedia bullseye-wikimedia prometheus-openstack-exporter

root@cloudcontrol1007:~# apt update
root@cloudcontrol1007:~# apt install prometheus-openstack-exporter
[...]
Unpacking prometheus-openstack-exporter (1.5.0-1) over (0.1.4-2.2) ...

Re-run puppet on cloudcontrol1007 and it's looking good:

root@cloudcontrol1007:~# systemctl status prometheus-openstack-exporter.service
● prometheus-openstack-exporter.service - prometheus openstack exporter
     Loaded: loaded (/lib/systemd/system/prometheus-openstack-exporter.service; enabled; preset: enabled)
     Active: active (running) since Tue 2023-10-31 19:08:53 UTC; 24s ago
   Main PID: 258776 (prometheus-open)
      Tasks: 8 (limit: 617449)
     Memory: 5.9M
        CPU: 49ms
     CGroup: /system.slice/prometheus-openstack-exporter.service
             ├─258776 /bin/bash /usr/local/sbin/prometheus-openstack-exporter-wrapper --web.listen-address=:12345 --os-client-config=/etc/prometheus-openstack-expor>
             ├─258785 sudo -E -u prometheus /usr/bin/prometheus-openstack-exporter --web.listen-address=:12345 --os-client-config=/etc/prometheus-openstack-exporter>
             └─258786 /usr/bin/prometheus-openstack-exporter --web.listen-address=:12345 --os-client-config=/etc/prometheus-openstack-exporter.yaml --disable-slow-m>

Oct 31 19:08:53 cloudcontrol1007 systemd[1]: Started prometheus-openstack-exporter.service - prometheus openstack exporter.
Oct 31 19:08:53 cloudcontrol1007 sudo[258785]:     root : PWD=/ ; USER=prometheus ; COMMAND=/usr/bin/prometheus-openstack-exporter --web.listen-address=:12345 --os->
Oct 31 19:08:53 cloudcontrol1007 sudo[258785]: pam_unix(sudo:session): session opened for user prometheus(uid=106) by (uid=0)
fnegri reopened this task as Open.

Reopening because I want to enable the exporter in codfw as well, so we will catch similar issues in the future when testing in codfw.

fnegri changed the task status from Open to In Progress.Oct 31 2023, 7:28 PM
fnegri triaged this task as High priority.

Change 971491 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] P:openstack:codfw1dev enable prom exporter

https://gerrit.wikimedia.org/r/971491

Change 971491 merged by FNegri:

[operations/puppet@production] P:openstack:codfw1dev enable prom exporter

https://gerrit.wikimedia.org/r/971491

Change 971979 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] P:openstack:codfw1dev fix wrong hostname

https://gerrit.wikimedia.org/r/971979

Change 971979 merged by FNegri:

[operations/puppet@production] P:openstack:codfw1dev fix wrong hostname

https://gerrit.wikimedia.org/r/971979

The prometheus exporter is now running in codfw on cloudcontrol2005-dev:

root@cloudcontrol2005-dev:~# systemctl status prometheus-openstack-exporter
● prometheus-openstack-exporter.service - prometheus openstack exporter
     Loaded: loaded (/lib/systemd/system/prometheus-openstack-exporter.service; enabled; preset: enabled)
     Active: active (running) since Mon 2023-11-06 15:53:00 UTC; 2min 9s ago