Page MenuHomePhabricator

acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP
Open, MediumPublic

Description

It seems we have some expired certs around our domain, for example:

https://accounts.wmflabs.org/

Not After
2/5/2021, 10:01:00 AM (Central European Standard Time)

Apparently in some scenarios seems like acme-chief ignores the SIGHUP signal for some obscure reason, thus not refreshing certificates.

This issue was solved by restarting acme-chief by hand + run-puppet-agent on the affected servers (openstack project-proxy VMs)

Event Timeline

dcaro triaged this task as High priority.Feb 5 2021, 9:18 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T09:19:29Z] <dcaro> Some certs around the infra are expired (T273956)

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T10:21:28Z] <dcaro> This was affecting maps and several others, maps and project-proxy have been fixed (T273956)

What was wrong with acme-chief?

aborrero lowered the priority of this task from High to Medium.Feb 5 2021, 10:34 AM
aborrero renamed this task from Expired certificates in cloud urls to acme-chief didn't refresh certificates for cloud front proxies.Feb 5 2021, 10:40 AM
aborrero updated the task description. (Show Details)

The cause for the certs to be expired was that acme-chief serivce on the
acme-chief hosts:

  • paws-acme-chief-01.paws.eqiad.wmflabs
  • cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
  • project-proxy-acme-chief-01.project-proxy.eqiad.wmflabs

Was stuck and not reloading the configuration, restarting the service forced a
refresh of the certs and then running puppet on the proxies (for each project)
distributed the new certs around.

There was also an issue with maps.wmflabs.org, that was serving the certificate
for tiles.maps.wmflabs.org, that was a hiera misconfiguration:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/d0d2250d15402129c6f51b4bc1960d42577f032f%5E%21/

Then a puppet run configured nginx properly (the certs were already there).

aborrero renamed this task from acme-chief didn't refresh certificates for cloud front proxies to acme-chief sometimes don't refresh certificates because it ignores SIGHUP.Feb 5 2021, 10:41 AM
aborrero renamed this task from acme-chief sometimes don't refresh certificates because it ignores SIGHUP to acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP.
aborrero added subscribers: Krenair, Vgutierrez.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

we had some occurrences of this issue in prod, but the monitoring allows us to detect it on time before anything nasty happens

At this point, we have suggested something like a regular restart for the service via systemd timer. That should be an easy enough fix. It probably only needs to be done weekly or something, depending on how acme-chief works.

You could take the script that Icinga _would_ use but use it yourself without all the Icinga around it.

So take modules/nagios_common/files/check_commands/check_ssl, a Perl script and run it with the command-line we currently use to check LE certs, which is:

check_ssl --warning 7 --critical 3 -H $HOSTADDRESS$ -p 443 --cn $ARG1$

Then you can run a timer every hour "if this expires soon THEN and only then, do a restart".

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:03 PM

@dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it should avoid acme-chief to hang indefinitely, you could try enabling it on your instances as well, it should be as easy as updating to acme-chief 0.34 and setting profile::acme_chief::watchdog_sec like we did on https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335

\o/ thanks a lot @Vgutierrez, will try it soon(ish)