acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 5 2021, 9:18 AM

Description

It seems we have some expired certs around our domain, for example:

https://accounts.wmflabs.org/

Not After
2/5/2021, 10:01:00 AM (Central European Standard Time)

Apparently in some scenarios seems like acme-chief ignores the SIGHUP signal for some obscure reason, thus not refreshing certificates.

This issue was solved by restarting acme-chief by hand + run-puppet-agent on the affected servers (openstack project-proxy VMs)

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		bd808	T273956 acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP
Resolved		Vgutierrez	T292619 Implement a watchdog mechanism on acme-chief
Resolved		taavi	T298353 problem with let's encrypt cert for star.tools.wmflabs.org
Resolved		Vgutierrez	T293585 [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy
Resolved	BUG REPORT	Urbanecm	T293070 upload.wikimedia.beta.wmflabs.org certificate expired (October 2021)
Resolved		Vgutierrez	T271808 The certificate for upload.beta.wmflabs.org expired on January 12, 2021.
Resolved		Krenair	T267858 The certificate for upload.beta.wmflabs.org expired on November 13, 2020.
Duplicate	BUG REPORT	None	T293251 The certificate for upload.wikimedia.beta.wmflabs.org expired on October 9, 2021.
Resolved		None	T262816 The certificate for en.wikipedia.beta.wmflabs.org expired on 2020-09-14
Resolved	BUG REPORT	Krenair	T257968 Certificate for *.beta.wmflabs.org has expired (July 2020)
Resolved		Vgutierrez	T259338 do not generate metadata for parts that aren't allowed
Duplicate		None	T262806 Beta cluster certificates have expired (September 2020)
Resolved	BUG REPORT	None	T296000 *.beta.wmflabs.org Certificate has expired (November 2021 edition)
Resolved	BUG REPORT	MatthewVernon	T301995 The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022.
Resolved	BUG REPORT	None	T306492 *.beta.wmflabs.org Certificate has expired (April 2022 edition)
Resolved		ori	T310957 *.beta.wmflabs.org Certificate has expired (June 2022 edition)
Resolved	BUG REPORT	AlexisJazz	T337642 upload.wikimedia.beta.wmflabs.org certificate expired (May 2023)

Event Timeline

dcaro triaged this task as High priority.Feb 5 2021, 9:18 AM

dcaro created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2021, 9:19 AM

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T09:19:29Z] <dcaro> Some certs around the infra are expired (T273956)

Enterprisey subscribed.Feb 5 2021, 9:34 AM

Also with https://wsexport.wmflabs.org/

Balajijagadesh subscribed.Feb 5 2021, 9:40 AM

• Marostegui subscribed.Feb 5 2021, 9:45 AM

taavi subscribed.Feb 5 2021, 10:01 AM

Mentioned in SAL (#wikimedia-cloud) [2021-02-05T10:21:28Z] <dcaro> This was affecting maps and several others, maps and project-proxy have been fixed (T273956)

aborrero mentioned this in T273959: cloud: monitor/alert on health of TLS certs used on shared front proxy setup.Feb 5 2021, 10:29 AM

What was wrong with acme-chief?

aborrero lowered the priority of this task from High to Medium.Feb 5 2021, 10:34 AM

aborrero renamed this task from Expired certificates in cloud urls to acme-chief didn't refresh certificates for cloud front proxies.Feb 5 2021, 10:40 AM

aborrero updated the task description. (Show Details)

The cause for the certs to be expired was that acme-chief serivce on the
acme-chief hosts:

paws-acme-chief-01.paws.eqiad.wmflabs
cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
project-proxy-acme-chief-01.project-proxy.eqiad.wmflabs

Was stuck and not reloading the configuration, restarting the service forced a
refresh of the certs and then running puppet on the proxies (for each project)
distributed the new certs around.

There was also an issue with maps.wmflabs.org, that was serving the certificate
for tiles.maps.wmflabs.org, that was a hiera misconfiguration:

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/d0d2250d15402129c6f51b4bc1960d42577f032f%5E%21/

Then a puppet run configured nginx properly (the certs were already there).

aborrero renamed this task from acme-chief didn't refresh certificates for cloud front proxies to acme-chief sometimes don't refresh certificates because it ignores SIGHUP.Feb 5 2021, 10:41 AM

aborrero renamed this task from acme-chief sometimes don't refresh certificates because it ignores SIGHUP to acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP.

aborrero added subscribers: Krenair, Vgutierrez.

Ladsgroup subscribed.Feb 5 2021, 12:54 PM

AmandaNP added a subscriber: stwalkerster.Feb 5 2021, 1:54 PM

AmandaNP subscribed.

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

taavi added a project: Acme-chief.Feb 6 2021, 10:04 AM

RhinosF1 subscribed.Feb 6 2021, 12:55 PM

In T273956#6808348, @Krenair wrote:

I think I've seen acme-chief not responding to SIGHUP as expected before in deployment-prep, I worry this could happen in prod too.

we had some occurrences of this issue in prod, but the monitoring allows us to detect it on time before anything nasty happens

• Bstorm mentioned this in T282264: Monitor certificate validity for Cloud VPS.May 7 2021, 6:33 PM

At this point, we have suggested something like a regular restart for the service via systemd timer. That should be an easy enough fix. It probably only needs to be done weekly or something, depending on how acme-chief works.

You could take the script that Icinga _would_ use but use it yourself without all the Icinga around it.

So take modules/nagios_common/files/check_commands/check_ssl, a Perl script and run it with the command-line we currently use to check LE certs, which is:

check_ssl --warning 7 --critical 3 -H $HOSTADDRESS$ -p 443 --cn $ARG1$

Then you can run a timer every hour "if this expires soon THEN and only then, do a restart".

dcaro removed dcaro as the assignee of this task.Aug 10 2021, 5:03 PM

Vgutierrez mentioned this in T292619: Implement a watchdog mechanism on acme-chief.Oct 6 2021, 9:29 AM

Vgutierrez closed subtask T292619: Implement a watchdog mechanism on acme-chief as Resolved.Oct 18 2021, 3:39 PM

@dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it should avoid acme-chief to hang indefinitely, you could try enabling it on your instances as well, it should be as easy as updating to acme-chief 0.34 and setting profile::acme_chief::watchdog_sec like we did on https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335

\o/ thanks a lot @Vgutierrez, will try it soon(ish)

dcaro added a project: User-dcaro.Oct 18 2021, 4:21 PM

bd808 mentioned this in T298353: problem with let's encrypt cert for star.tools.wmflabs.org.Jan 1 2022, 1:06 AM

aborrero added a subtask: T298353: problem with let's encrypt cert for star.tools.wmflabs.org.Feb 7 2022, 12:41 PM

aborrero mentioned this in T301117: toolsbeta acme-chief certtificate has expired.

aborrero added a subtask: T301117: toolsbeta acme-chief certtificate has expired.

aborrero removed a subtask: T301117: toolsbeta acme-chief certtificate has expired.Feb 7 2022, 12:46 PM

Enterprisey unsubscribed.Feb 8 2022, 12:43 AM

taavi closed subtask T298353: problem with let's encrypt cert for star.tools.wmflabs.org as Resolved.Feb 16 2022, 3:18 PM

In T273956#7437046, @Vgutierrez wrote:

@dcaro I've implemented systemd's watchdog support on acme-chief. This is already running on the production instances and it should avoid acme-chief to hang indefinitely, you could try enabling it on your instances as well, it should be as easy as updating to acme-chief 0.34 and setting profile::acme_chief::watchdog_sec like we did on https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335

In T273956#7437259, @dcaro wrote:

\o/ thanks a lot @Vgutierrez, will try it soon(ish)

@Majavah made this happen everywhere with https://gerrit.wikimedia.org/r/c/operations/puppet/+/759439 by making the default value for the profile::acme_chief::watchdog_sec hiera setting 600.

Toolforge admins got a notice today from Let's Encrypt that *.toolforge.org, *.tools.wmflabs.org, mail.tools.wmcloud.org, mail.tools.wmflabs.org, toolforge.org, and tools.wmflabs.org were stale and expiring in 11 days. I checked on tools-acme-chief-01.tools.eqiad.wmflabs and found that the acme-chief was running with an uptime of 1 months 11 days (since Thu 2022-02-03). Issuing a service acme-chief restart followed by service acme-chief status showed "Counter({'NEEDS_RENEWAL': 6, 'VALID': 2})" and then the renewals starting to be processed.

Per my earlier investigation in T273956#7786238 I assumed that the watchdog process should be in-place here making this manual restart unnecessary. I eventually figured out that acme-chief 0.34-1 is needed to get the watchdog functionality and 0.29-1 was installed on tools-acme-chief-01. apt update; apt install acme-chief was used to upgrade the package and repeated on tools-acme-chief-02.

We probably need to do a similar forced update on the rest of the acme-chief servers used as WMCS infrastructure.

Hosts to check for/update to acme-chief 0.34-1 from https://openstack-browser.toolforge.org/puppetclass/role::acme_chief::cloud:

cloudinfra-acme-chief-01.cloudinfra.eqiad1.wikimedia.cloud
deployment-acme-chief03.deployment-prep.eqiad1.wikimedia.cloud
deployment-acme-chief04.deployment-prep.eqiad1.wikimedia.cloud
paws-acme-chief-01.paws.eqiad1.wikimedia.cloud
project-proxy-acme-chief-01.project-proxy.eqiad1.wikimedia.cloud
tools-acme-chief-01.tools.eqiad1.wikimedia.cloud
tools-acme-chief-02.tools.eqiad1.wikimedia.cloud
toolsbeta-acme-chief-01.toolsbeta.eqiad1.wikimedia.cloud

The tools and project-proxy instances all needed the update. All other instances were already running the latest version. Hopefully this error will be eliminated now.

Restricted Application added a project: User-bd808. · View Herald TranscriptMar 17 2022, 5:24 PM

aborrero awarded a token.Mar 17 2022, 5:50 PM

bd808 mentioned this in T306492: *.beta.wmflabs.org Certificate has expired (April 2022 edition).Apr 20 2022, 2:17 PM

bd808 added a subtask: T293585: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy.

dcaro mentioned this in T307333: [tools] toolserver.org cert is expiring in 2 days.May 2 2022, 8:57 AM

Vgutierrez closed subtask T293585: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy as Resolved.Oct 17 2022, 2:42 PM

AlexisJazz reopened subtask T293585: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy as Open.May 28 2023, 6:20 AM

AlexisJazz closed subtask T293585: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy as Resolved.Jun 13 2023, 9:26 AM