Page MenuHomePhabricator

Monitor certificate validity for Cloud VPS
Closed, ResolvedPublic

Description

From T282102 and this thread, it appears that there was an outage related to expired TLS certificates which got manually detected. Would it be possible to set up automatic monitoring, so that future certificate expirations will get caught automatically, and so that alerts get set out when this happens again? Apologies if this is already happening, but from the tickets/threads it’s not clear whether or not there’s automatic monitoring in place today.

Event Timeline

The last couple outages for this were caused by a need to restart acme-chief due to a known issue and more or less "wontfix". While a paging-type alert is not a bad idea, a simple systemd-timer that checks if the cert is coming due and restarts acme-chief if it isn't doing its job would shore that up, no?

Then we cut out the need for a person restarting the service.

+1 to monitoring/alerting, besides the known acme-chief issue, there are other reasons getting a cert from LE could fail, so having an alert in case we have say, less than a week left would be good.

The last couple outages for this were caused by a need to restart acme-chief due to a known issue and more or less "wontfix". While a paging-type alert is not a bad idea, a simple systemd-timer that checks if the cert is coming due and restarts acme-chief if it isn't doing its job would shore that up, no?

I'm not sure what the actual issue is, but we could just restart acme-chief on a regular basis right? It should be stateless?

If it helps, feel free to adopt https://certmon.toolforge.org/ which was quickly thrown together in an attempt to help Wikimedia to improve its monitoring. See source code and the metrics endpoint for Prometheus monitoring. Feel free to fork, send pull requests, whatever. Please do tell if you end up using it, I’m quite curious. If it’s useful, my personal preference would be that you’d clone the repo into a better place (perhaps a Phabricator project) and run it yourself, so the Wikimedia SRE team could change things without me getting involved.

@Sascha That's pretty cool! ...and we might even use that in the end.
I did however realize this might be a problem that we already sort of have solved.

# TODO: remove this, now using LE automatic cert renew
# *.wmflabs.org (labs wildcard cert, testing tools.wmflabs.org)
monitoring::service { 'https_wmflabs':
    ensure        => 'absent',
    description   => 'HTTPS-wmflabs',
    check_command => 'check_ssl_http!tools.wmflabs.org',
    host          => 'tools.wmflabs.org',
    notes_url     => 'https://phabricator.wikimedia.org/tag/toolforge/',
}

I think we had too much faith in acme-chief and disabled our monitor. We may be able to simply re-enable that monitor, depending on how much time it allows.

Change 690055 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] toolforge: re-enable toolforge certificate monitor

https://gerrit.wikimedia.org/r/690055

Cool, glad it’s useful! When you set up Prometheus rules, consider alerting when certmon_tls_certificate_expiration_timestamp - time() becomes less than ~2 weeks or so for a domain; see Prometheus recommendations for timestamps. Then, the the SRE team would get plenty of advance notice for expiring TLS certificates, allowing problems to be fixed long before they become user-visible outages. (Apologies if I’m stating the obvious here, you’ll know more about this than me).

Change 690055 merged by Bstorm:

[operations/puppet@production] toolforge: re-enable toolforge certificate monitor

https://gerrit.wikimedia.org/r/690055

Change 692434 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloud-vps: enable the cert monitor for acme-chief

https://gerrit.wikimedia.org/r/692434

Change 692434 merged by Bstorm:

[operations/puppet@production] cloud-vps: enable the cert monitor for acme-chief

https://gerrit.wikimedia.org/r/692434

At this point, wmcloud.org certs (which is the same cert as for wmflabs.org, iirc) and toolforge.org are monitored at the paging level (if nearing expiration, it will page WMCS team).

I think that's good enough to close this task and carry over such ideas as automated restarts to the other open tasks about this.

I just realized I haven't submitted one for PAWS yet, and that has an independent acme-chief system as well, I believe.

Change 692448 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] paws: monitor the frontend certs maintained by acme-chief

https://gerrit.wikimedia.org/r/692448

Change 692448 merged by Bstorm:

[operations/puppet@production] paws: monitor the frontend certs maintained by acme-chief

https://gerrit.wikimedia.org/r/692448