Page MenuHomePhabricator

Toolforge Puppet CA expired
Closed, ResolvedPublic

Description

root@tools-sgebastion-11:~# openssl x509 -in /var/lib/puppet/ssl/certs/ca.pem -noout -dates
notBefore=Jun 27 01:36:58 2017 GMT
notAfter=Jun 27 01:36:58 2022 GMT

So far this doesn't seem to have caused any user facing impact, but I fear the k8s cluster might see some issues since we use puppet certs for etcd internal communication and apiserver->etcd traffic.

Related Objects

Event Timeline

taavi triaged this task as Unbreak Now! priority.Jun 27 2022, 1:37 PM
taavi created this task.

This might be fixable with sre.puppet.renew-cert if it works for our use-case

Updated to add: nope, that's for the wrong certs

Mentioned in SAL (#wikimedia-cloud) [2022-06-27T14:50:15Z] <taavi> backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

This might be the reason why this alert triggered:
https://alerts.wikimedia.org/?q=team%3Dwmcs&q=alertname%3Dtoolschecker%3A%20All%20k8s%20etcd%20nodes%20are%20healthy

summary: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string 'OK' not found on 'http://checker.tools.wmflabs.org:80/etcd/k8s' - 443 bytes in 0.069 second response time
4 hours agoinstance: checker.tools.wmflabs.org
source: icinga
team: wmcs

Going the the checker url ends up with:

Caught exception: HTTPSConnectionPool(host='tools-k8s-etcd-13.tools.eqiad1.wikimedia.cloud', port=2379): Max retries exceeded with url: /health (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_read_bytes', 'sslv3 alert bad certificate')])")))

Mentioned in SAL (#wikimedia-cloud) [2022-06-27T14:58:54Z] <taavi> renew puppet ca cert and certificate for tools-puppetmaster-02 T311412

Mentioned in SAL (#wikimedia-cloud) [2022-06-27T17:15:37Z] <taavi> T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen

This might be the reason why this alert triggered:

Indeed that was.

Christianjade1 changed the task status from Resolved to Invalid.Jun 29 2022, 3:35 AM
JJMC89 changed the task status from Invalid to Resolved.Jun 29 2022, 3:38 AM