Page MenuHomePhabricator

"FAIL: debmonitor-client" Email Alerts for db2202.codfw.wmnet
Closed, ResolvedPublicSecurity

Description

We have been receiving email alerts since March 27th for the host
db2202.codfw.wmnet. The alerts are related to a failure in the debmonitor-client. The subject of the emails is "FAIL: debmonitor-client", and the body of the email contains the following message:

Systemd timer ran the following command:

    /usr/bin/debmonitor-client

Its return value was 1 and emitted the following output:

INFO:debmonitor:Found 574 installed binary packages
INFO:debmonitor:Found 5 upgradable binary packages (including new dependencies)
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2546)'))': /hosts/db2202.codfw.wmnet/update
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2546)'))': /hosts/db2202.codfw.wmnet/update
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2546)'))': /hosts/db2202.codfw.wmnet/update
ERROR:debmonitor:Failed to execute DebMonitor CLI: HTTPSConnectionPool(host='debmonitor.discovery.wmnet', port=443): Max retries exceeded with url: /hosts/db2202.codfw.wmnet/update (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2546)')))

It appears that there is an SSL certificate expiration issue affecting the debmonitor-client's ability to update. Can someone please investigate this and take the necessary steps to resolve the issue?

Event Timeline

Marostegui triaged this task as High priority.

@ABran-WMF please reimage this host asap, puppet has been stopped for a long time: (40302 minutes ago). and the cert looks like it has expired

taavi changed the visibility from "Custom Policy" to "Public (No Login Required)".
taavi changed the edit policy from "Custom Policy" to "All Users".

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2202.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2202.codfw.wmnet with OS bookworm completed:

  • db2202 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404030732_arnaudb_418792_db2202.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

host has been reimaged