Page MenuHomePhabricator

*.beta.wmflabs.org Certificate has expired (November 2021 edition)
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

Also causing MinervaNeue tests to fail

What should have happened instead?:

  • Certificate should be renewed

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2021-11-18T18:39:56Z] <urbanecm> deployment-prep root@deployment-acme-chief03:/var/lib/acme-chief/certs/mx# rm new && mv dbe71be4db0b4e58a3da4fc410d322bd dbe71be4db0b4e58a3da4fc410d322bd-bak && ln -s 92b8ed4bf5494405a75a0b3fb1d59422 new # T296000

Mentioned in SAL (#wikimedia-releng) [2021-11-18T18:43:18Z] <urbanecm> deployment-prep remove wikifunctions-related from ACME chief to attempt to at least workaround T296000

I logged to acme-chief and checked the cert there. root@deployment-acme-chief03:/var/lib/acme-chief/certs/unified/live# openssl x509 -in rsa-2048.crt -text -noout told me it is not valid anymore. Running the acme-chief-backend manually:

root@deployment-acme-chief03:/var/log# /usr/bin/acme-chief-backend
SIGHUP received
Missing/invalid DNS zone updater CMD timeout, using the default one: 60.00
Certificate unified type ec-prime256v1 expired on 2021-11-18 15:26:11
Certificate unified type rsa-2048 expired on 2021-11-18 15:25:41
Certificate wikibase type ec-prime256v1 expired on 2021-10-09 01:01:31
Certificate wikibase type rsa-2048 expired on 2021-10-09 01:00:55
Number of certificates per status: Counter({'EXPIRED': 4, 'VALID': 1, 'NEEDS_RENEWAL': 1})
Starting main loop...
Traceback (most recent call last):
  File "/usr/bin/acme-chief-backend", line 11, in <module>
    load_entry_point('acme-chief==0.34', 'console_scripts', 'acme-chief-backend')()
  File "/usr/lib/python3/dist-packages/acme_chief/acme_chief.py", line 981, in main
    ACMEChief().run()
  File "/usr/lib/python3/dist-packages/acme_chief/acme_chief.py", line 408, in run
    self.certificate_management()
  File "/usr/lib/python3/dist-packages/acme_chief/acme_chief.py", line 940, in certificate_management
    self._fetch_ocsp_response(cert_id, key_type_id)
  File "/usr/lib/python3/dist-packages/acme_chief/acme_chief.py", line 874, in _fetch_ocsp_response
    file_type='cert', kind=kind, cert_type='full_chain'))
  File "/usr/lib/python3/dist-packages/acme_chief/x509.py", line 318, in load
    with open(path, 'rb') as pem_file:
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/acme-chief/certs/mx/new/ec-prime256v1.chained.crt'
root@deployment-acme-chief03:/var/log#

After workarounding that issue (by copying old cert versions in the nearly-empty new directory), it complained about sth else:

acme.messages.Error: urn:ietf:params:acme:error:malformed :: The request message was malformed :: Error creating new order :: Domain name "m.wikifunctions.beta.wmflabs.org" is redundant with a wildcard domain in the same request. Remove one or the other from the certificate request.

This leaded me to disable wikifunctions certificates (cc @Jdforrester-WMF) and mx certs temporarily, to at least unbreak most of beta.

Re-running acme-chief updated the certs at acme-chief, and running puppet (and manually reloading trafficserver-tls) at the cache layer fixed things.

TODO steps

Fix the mx cert issue and wikifunctions cert issue in a non-destructive way.

@Urbanecm: can you check the upload hosts? I believe something needs restarting for them.

RhinosF1 renamed this task from 'en.wikipedia.beta.wmflabs.org' Certificate has expired to *.beta.wmflabs.org Certificate has expired.Fri, Nov 19, 9:56 PM
RhinosF1 renamed this task from *.beta.wmflabs.org Certificate has expired to *.beta.wmflabs.org Certificate has expired (November 2021 edition).

Mentioned in SAL (#wikimedia-releng) [2021-11-19T21:58:33Z] <urbanecm> urbanecm@deployment-cache-upload06:~$ sudo systemctl reload trafficserver-tls.service # T296000

@Urbanecm: can you check the upload hosts? I believe something needs restarting for them.

Reloaded trafficserver-tls at cache-upload06, and it works for me now. Acmechief already distributed the right cert, so a reload was the last missing thing.

@Urbanecm: can you check the upload hosts? I believe something needs restarting for them.

Reloaded trafficserver-tls at cache-upload06, and it works for me now. Acmechief already distributed the right cert, so a reload was the last missing thing.

Works for me as well. Note that while upload.wikimedia.beta.wmflabs.org usually lasts for 90 days it expired after just a month this time. T293070, T296113

After workarounding that issue (by copying old cert versions in the nearly-empty new directory), it complained about sth else:

acme.messages.Error: urn:ietf:params:acme:error:malformed :: The request message was malformed :: Error creating new order :: Domain name "m.wikifunctions.beta.wmflabs.org" is redundant with a wildcard domain in the same request. Remove one or the other from the certificate request.

This leaded me to disable wikifunctions certificates (cc @Jdforrester-WMF)

WF config should be the same as for Wikidata on the Beta Cluster; it sounds like there's a wildcard line for it instead?