Page MenuHomePhabricator

Certificate expiration monitoring
Closed, ResolvedPublic

Description

In T307382, we only noticed that the etcd tlsproxy certificate in eqiad had expired when paged for conf2005/Etcd replication lag. AFAICT, there was no warning that the certificate was near expiring.

Event Timeline

Change 788435 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: add etcd tlsproxy certificate monitoring

https://gerrit.wikimedia.org/r/788435

Change 788435 merged by Dzahn:

[operations/puppet@production] profile: add etcd tlsproxy certificate monitoring

https://gerrit.wikimedia.org/r/788435

Change 789270 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etcd::tlsproxy: add monitoring for TLS cert expiration

https://gerrit.wikimedia.org/r/789270

Change 789270 abandoned by Dzahn:

[operations/puppet@production] etcd::tlsproxy: add monitoring for TLS cert expiration

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/789176 was already merged instead

https://gerrit.wikimedia.org/r/789270

monitoring has been added in Icinga and works now:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=etcd+tlsproxy

only slight issue I see is we will get 6 alerts at once when the cert gets close to expiry in 1821 minus 60 days.

But on the other hand it checks ecah individual host for having other (non-cert but webserver) issues and would detect if we forget to add a hostname to the cert.

So I guess we can call it resolved.

colewhite claimed this task.

only slight issue I see is we will get 6 alerts at once when the cert gets close to expiry in 1821 minus 60 days.

But on the other hand it checks ecah individual host for having other (non-cert but webserver) issues and would detect if we forget to add a hostname to the cert.

Cert changes do not notify nginx for a reload. After we left for the evening, two of the hosts still served the old certificate until the reload was performed on the secondary hosts the following morning.

Ideally, we'll move to a more unified certificate monitoring approach. I think this arrangement will be ok until we can adopt that unified solution.

Good points, especially about the reload notify! Alright, yep.

Change 790656 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:etcd::tlsproxy: add documentation and fix minor lint issues

https://gerrit.wikimedia.org/r/790656

Change 790657 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:etcd::tlsproxy: move to cfssl pki

https://gerrit.wikimedia.org/r/790657

Change 790656 merged by Jbond:

[operations/puppet@production] P:etcd::tlsproxy: add documentation and fix minor lint issues

https://gerrit.wikimedia.org/r/790656