Page MenuHomePhabricator

problem with let's encrypt cert for star.tools.wmflabs.org
Closed, ResolvedPublic

Description

I received this email:

Hello,

Your certificate (or certificates) for the names listed below will expire in 11 days (on 08 Jan 22 10:00 +0000). Please make sure to renew your certificate before then, or visitors to your web site will encounter errors.

We recommend renewing certificates automatically when they have a third of their total lifetime left. For Let's Encrypt's current 90-day certificates, that means renewing 30 days before expiration. See https://letsencrypt.org/docs/integration-guide/ for details.

*.tools.wmflabs.org
tools.wmflabs.org
[...]

Why wasn't this cert renewed automatically like always?

I suspect of a bug/problem in tools acme chief server.

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2021-12-28T20:31:01Z] <taavi> restarting acme-chief to debug T298353

I just went to have a look and it appears the cert in
/var/lib/acme-chief/certs/tools-legacy/live/rsa-2048.crt just got renewed
like a minute ago. Majavah I see you're logged in, did you do some magic?

I just went to have a look and it appears the cert in
/var/lib/acme-chief/certs/tools-legacy/live/rsa-2048.crt just got renewed
like a minute ago. Majavah I see you're logged in, did you do some magic?

The only thing I did was sudo systemctl restart acme-chief.service. As far as I can see on the logs renewal failed:

Dec 28 20:28:33 tools-acme-chief-01 acme-chief-backend[17337]: Handling pushed challenges event for tools-legacy / ec-prime256v1
Dec 28 20:28:34 tools-acme-chief-01 acme-chief-backend[17337]: Handling order finalized event for tools-legacy / ec-prime256v1
Dec 28 20:28:35 tools-acme-chief-01 acme-chief-backend[17337]: Enforcing staging_time for tools-legacy / ec-prime256v1
Dec 28 20:28:35 tools-acme-chief-01 acme-chief-backend[17337]: Pushing the new certificate for tools-legacy
Dec 28 20:28:35 tools-acme-chief-01 acme-chief-backend[17337]: Waiting till tools-legacy / rsa-2048 is generated to be able to push the new certificate
Dec 28 20:28:35 tools-acme-chief-01 acme-chief-backend[17337]: Handling pushed challenges event for tools-legacy / rsa-2048
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: ACME directory has returned a generic finalization error for order https://acme-v02.api.letsencrypt.org/acme/order/69142910/50878449700
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: Problem getting certificate for certificate tools-legacy / rsa-2048
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: Traceback (most recent call last):
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme_chief/acme_requests.py", line 464, in finalize_order
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     self.acme_client.only_finalize_order(polled_order)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme_chief/acme_requests.py", line 218, in only_finalize_order
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     super().finalize_order(orderr, deadline=datetime.fromtimestamp(0))
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme/client.py", line 757, in finalize_order
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     self._post(orderr.body.finalize, wrapped_csr)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme/client.py", line 97, in _post
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     return self.net.post(*args, **kwargs)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme/client.py", line 1228, in post
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     return self._post_once(*args, **kwargs)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme/client.py", line 1242, in _post_once
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     response = self._check_response(response, content_type=content_type)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme/client.py", line 1097, in _check_response
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     raise messages.Error.from_json(jobj)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: acme.messages.Error: urn:ietf:params:acme:error:orderNotReady :: Order's status ("valid") is not acceptable for finalization
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: The above exception was the direct cause of the following exception:
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: Traceback (most recent call last):
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme_chief/acme_chief.py", line 709, in _handle_pushed_challenges
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     session.finalize_order(csr_id)
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:   File "/usr/lib/python3/dist-packages/acme_chief/acme_requests.py", line 468, in finalize_order
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]:     raise ACMEError('Unable to get certificate') from finalize_error
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: acme_chief.acme_requests.ACMEError: Unable to get certificate
Dec 28 20:28:36 tools-acme-chief-01 acme-chief-backend[17337]: Handling pushed CSR event for tools_mail / ec-prime256v1

althought it seems to have worked a bit after that:

Dec 28 20:28:42 tools-acme-chief-01 acme-chief-backend[17337]: Pushing the new certificate for tools-legacy
Dec 28 20:28:42 tools-acme-chief-01 acme-chief-backend[17337]: Waiting till tools-legacy / rsa-2048 is generated to be able to push the new certificate
Dec 28 20:28:42 tools-acme-chief-01 acme-chief-backend[17337]: Refreshing new OCSP response for certificate tools-legacy / ec-prime256v1
Dec 28 20:28:42 tools-acme-chief-01 acme-chief-backend[17337]: new OCSP response refreshed successfully for tools-legacy / ec-prime256v1
Dec 28 20:28:42 tools-acme-chief-01 acme-chief-backend[17337]: Handling new certificate event for tools-legacy / rsa-2048
Dec 28 20:28:43 tools-acme-chief-01 acme-chief-backend[17337]: Skipping challenge validation for certificate tools-legacy / rsa-2048
Dec 28 20:29:00 tools-acme-chief-01 acme-chief-backend[17337]: Refreshing live OCSP response for certificate tools-legacy / rsa-2048
Dec 28 20:29:00 tools-acme-chief-01 acme-chief-backend[17337]: live OCSP response refreshed successfully for tools-legacy / rsa-2048
aborrero lowered the priority of this task from High to Medium.Dec 28 2021, 8:51 PM

This sounds like more of T273956: acme-chief sometimes doesn't refresh certificates because it ignores SIGHUP. We should try setting the profile::acme_chief::watchdog_sec hiera key for all Cloud VPS deployments similar to what was done for prod with https://gerrit.wikimedia.org/r/c/operations/puppet/+/731335.

taavi claimed this task.