Page MenuHomePhabricator

acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP
Closed, ResolvedPublic

Description

It seems we have some expired certs around our domain, for example:

https://accounts.wmflabs.org/

Not After
2/5/2021, 10:01:00 AM (Central European Standard Time)

Apparently in some scenarios seems like acme-chief ignores the SIGHUP signal for some obscure reason, thus not refreshing certificates.

This issue was solved by restarting acme-chief by hand + run-puppet-agent on the affected servers (openstack project-proxy VMs)

Related Objects

StatusSubtypeAssignedTask
Resolvedbd808
ResolvedVgutierrez
Resolvedtaavi
ResolvedVgutierrez
ResolvedBUG REPORTUrbanecm
ResolvedVgutierrez
ResolvedKrenair
DuplicateBUG REPORTNone
ResolvedNone
ResolvedBUG REPORTKrenair
ResolvedVgutierrez
DuplicateNone
ResolvedBUG REPORTNone
ResolvedBUG REPORTMatthewVernon
ResolvedBUG REPORTNone
Resolvedori
ResolvedBUG REPORTAlexisJazz

Event Timeline

dcaro triaged this task as High priority.Feb 5 2021, 9:18 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T09:19:29Z] <dcaro> Some certs around the infra are expired (T273956)

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T10:21:28Z] <dcaro> This was affecting maps and several others, maps and project-proxy have been fixed (T273956)

aborrero lowered the priority of this task from High to Medium.Feb 5 2021, 10:34 AM
aborrero renamed this task from Expired certificates in cloud urls to acme-chief didn't refresh certificates for cloud front proxies.Feb 5 2021, 10:40 AM
aborrero updated the task description. (Show Details)

The cause for the certs to be expired was that acme-chief serivce on the
acme-chief hosts:

  • paws-acme-chief-01.paws.eqiad.wmflabs
  • cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
  • project-proxy-acme-chief-01.project-proxy.eqiad.wmflabs

Was stuck and not reloading the configuration, restarting the service forced a
refresh of the certs and then running puppet on the proxies (for each project)
distributed the new certs around.

There was also an issue with maps.wmflabs.org, that was serving the certificate
for tiles.maps.wmflabs.org, that was a hiera misconfiguration:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/d0d2250d15402129c6f51b4bc1960d42577f032f%5E%21/

Then a puppet run configured nginx properly (the certs were already there).

aborrero renamed this task from acme-chief didn't refresh certificates for cloud front proxies to acme-chief sometimes don't refresh certificates because it ignores SIGHUP.Feb 5 2021, 10:41 AM
aborrero renamed this task from acme-chief sometimes don't refresh certificates because it ignores SIGHUP to acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP.
aborrero added subscribers: Krenair, Vgutierrez.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

we had some occurrences of this issue in prod, but the monitoring allows us to detect it on time before anything nasty happens

At this point, we have suggested something like a regular restart for the service via systemd timer. That should be an easy enough fix. It probably only needs to be done weekly or something, depending on how acme-chief works.

You could take the script that Icinga _would_ use but use it yourself without all the Icinga around it.

So take modules/nagios_common/files/check_commands/check_ssl, a Perl script and run it with the command-line we currently use to check LE certs, which is:

check_ssl --warning 7 --critical 3 -H $HOSTADDRESS$ -p 443 --cn $ARG1$

Then you can run a timer every hour "if this expires soon THEN and only then, do a restart".

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:03 PM

@dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it should avoid acme-chief to hang indefinitely, you could try enabling it on your instances as well, it should be as easy as updating to acme-chief 0.34 and setting profile::acme_chief::watchdog_sec like we did on https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335

\o/ thanks a lot @Vgutierrez, will try it soon(ish)

@dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it should avoid acme-chief to hang indefinitely, you could try enabling it on your instances as well, it should be as easy as updating to acme-chief 0.34 and setting profile::acme_chief::watchdog_sec like we did on https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335

\o/ thanks a lot @Vgutierrez, will try it soon(ish)

@Majavah made this happen everywhere with https://gerrit.wikimedia.org/r/c/operations/puppet/+/759439 by making the default value for the profile::acme_chief::watchdog_sec hiera setting 600.

Toolforge admins got a notice today from Let's Encrypt that *.toolforge.org, *.tools.wmflabs.org, mail.tools.wmcloud.org, mail.tools.wmflabs.org, toolforge.org, and tools.wmflabs.org were stale and expiring in 11 days. I checked on tools-acme-chief-01.tools.eqiad.wmflabs and found that the acme-chief was running with an uptime of 1 months 11 days (since Thu 2022-02-03). Issuing a service acme-chief restart followed by service acme-chief status showed "Counter({'NEEDS_RENEWAL': 6, 'VALID': 2})" and then the renewals starting to be processed.

Per my earlier investigation in T273956#7786238 I assumed that the watchdog process should be in-place here making this manual restart unnecessary. I eventually figured out that acme-chief 0.34-1 is needed to get the watchdog functionality and 0.29-1 was installed on tools-acme-chief-01. apt update; apt install acme-chief was used to upgrade the package and repeated on tools-acme-chief-02.

We probably need to do a similar forced update on the rest of the acme-chief servers used as WMCS infrastructure.

bd808 claimed this task.

Hosts to check for/update to acme-chief 0.34-1 from https://openstack-browser.toolforge.org/puppetclass/role::acme_chief::cloud:

  • cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
  • deployment-acme-chief03.deployment-prep.eqiad1.wikimedia.cloud
  • deployment-acme-chief04.deployment-prep.eqiad1.wikimedia.cloud
  • paws-acme-chief-01.paws.eqiad1.wikimedia.cloud
  • project-proxy-acme-chief-01.project-proxy.eqiad1.wikimedia.cloud
  • tools-acme-chief-01.tools.eqiad1.wikimedia.cloud
  • tools-acme-chief-02.tools.eqiad1.wikimedia.cloud
  • toolsbeta-acme-chief-01.toolsbeta.eqiad1.wikimedia.cloud

The tools and project-proxy instances all needed the update. All other instances were already running the latest version. Hopefully this error will be eliminated now.