Page MenuHomePhabricator

Prometheus black-box probes for all puppetmaster hosts are failing
Closed, ResolvedPublic

Description

Prometheus black-box probes for all puppetmaster hosts are failing with:

Get "https://<IP>:<port>/puppet/v3": x509: certificate relies on legacy Common Name field, use SANs instead

Example for Service puppetmaster1001:8141 has failed probes (http_puppetmaster1001_eqiad_wmnet_backend_https_ip6).

This appears to have started 3 days ago, and presents like a recurrence of T373369: Service puppetmaster1001:8141 has failed probes (http_puppetmaster1003_eqiad_wmnet_backend_https_ip4). However, I'm having a hard time sorting out from that task what the resolution was.

Edit: The "3 days ago" reported by the alerts dashboard may be spurious, as I can easily find identical failures in the Network probes logstash dashboard prior to that. In fact, I'm having difficulty finding a time when these were not failing.

Event Timeline

@fgiunchedi - I'm having a hard time sorting out what the outcome w.r.t. these probe failures was from T373369 and / or T326657. Was there a long-term silence that might have recently expired?

IIRC there was, in order to fix those we'd need to run the blackbox exporter (and IIUC hence prometheus) using something that avoids x509: certificate relies on legacy Common Name field, use SANs instead like GODEBUG="x509ignoreCN=0". I acked those alerts (mistakenly, I wanted to re-add a longer silence) but from the IF point of view we can avoid a massive work on the Prometheus side since the Puppet Masters will hopefully be decommed soonish (~40 hosts to go with Puppet 5).

@fgiunchedi - I'm having a hard time sorting out what the outcome w.r.t. these probe failures was from T373369 and / or T326657. Was there a long-term silence that might have recently expired?

Indeed, now I have put the pieces back together and I think what happened is the acks expired when Prometheus was down on Mon (T393365) and thus the alerts came back.

IIRC there was, in order to fix those we'd need to run the blackbox exporter (and IIUC hence prometheus) using something that avoids x509: certificate relies on legacy Common Name field, use SANs instead like GODEBUG="x509ignoreCN=0". I acked those alerts (mistakenly, I wanted to re-add a longer silence) but from the IF point of view we can avoid a massive work on the Prometheus side since the Puppet Masters will hopefully be decommed soonish (~40 hosts to go with Puppet 5).

Indeed I'd rather have the acks in place (thank you !) than running blackbox-exporter with GODEBUG="x509ignoreCN=0"

from my POV we can resolve the task, what do you think ?

Thank you both for the follow-up! If acks were the solution before and they're now back in place, then by all means let's resolve this :)

fgiunchedi claimed this task.

SGTM!