Page MenuHomePhabricator

acme-chief failing in puppet with "Cannot open input file"
Open, NormalPublic

Description

Starting around 08:00 UTC today, puppet in cp-eqsin failed on acme-chief related activities with:

x509: Cannot open input file /etc/acmecerts/unified/new/ec-prime256v1.crt, No such file or directory

Full puppetfail log on P8808

Event Timeline

ema created this task.Fri, Jul 26, 8:49 AM
Restricted Application added a project: Operations. · View Herald TranscriptFri, Jul 26, 8:49 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Normal priority.Fri, Jul 26, 8:49 AM

Change 525778 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp-eqsin: do not deploy acme-chief unified certs

https://gerrit.wikimedia.org/r/525778

Change 525778 merged by Ema:
[operations/puppet@production] cp-eqsin: do not deploy acme-chief unified certs

https://gerrit.wikimedia.org/r/525778

This happens at the same time that the unified cert is being renewed:

Jul 26 08:00:02 acmechief1001 acme-chief-backend[8198]: Number of certificates per status: Counter({'VALID': 42, 'NEEDS_RENEWAL': 2})
Jul 26 08:00:02 acmechief1001 acme-chief-backend[8198]: Handling new certificate event for unified / ec-prime256v1
Jul 26 08:00:02 acmechief1001 acme-chief-backend[8198]: Creating new certificate version a40ba19e20ff4516bb7906d154cf5539 for unified
Jul 26 08:00:15 acmechief1001 acme-chief-backend[8198]: Triggering DNS zone update...
Jul 26 08:00:15 acmechief1001 acme-chief-backend[8198]: Running subprocess ['/usr/local/bin/acme-chief-gdnsd-sync.py', '--remote-servers', 'authdns1001.wikimedia.org', 'authdns2001.wikimedia.org', 'multat
Jul 26 08:00:17 acmechief1001 acme-chief-backend[8198]: Handling pushed CSR event for unified / ec-prime256v1
Jul 26 08:00:27 acmechief1001 acme-chief-backend[8198]: Handling validated challenges event for unified / ec-prime256v1
Jul 26 08:00:30 acmechief1001 acme-chief-backend[8198]: Handling pushed challenges event for unified / ec-prime256v1
Jul 26 08:00:34 acmechief1001 acme-chief-backend[8198]: Handling order finalized event for unified / ec-prime256v1
Jul 26 08:00:35 acmechief1001 acme-chief-backend[8198]: Enforcing staging_time for unified / ec-prime256v1
Jul 26 08:00:35 acmechief1001 acme-chief-backend[8198]: Staging_time will be enforced for unified / ec-prime256v1 till 2019-08-02 07:00:33
Jul 26 08:00:35 acmechief1001 acme-chief-backend[8198]: Handling new certificate event for unified / rsa-2048
Jul 26 08:00:39 acmechief1001 acme-chief-backend[8198]: Skipping challenge validation for certificate unified / rsa-2048
Jul 26 08:00:44 acmechief1001 acme-chief-backend[8198]: Handling pushed challenges event for unified / rsa-2048
Jul 26 08:00:48 acmechief1001 acme-chief-backend[8198]: Handling order finalized event for unified / rsa-2048
Jul 26 08:00:49 acmechief1001 acme-chief-backend[8198]: Enforcing staging_time for unified / rsa-2048
Jul 26 08:00:49 acmechief1001 acme-chief-backend[8198]: Staging_time will be enforced for unified / rsa-2048 till 2019-08-02 07:00:47
Jul 26 09:00:01 acmechief1001 systemd[1]: Reloading acme-chief Service.
Jul 26 09:00:01 acmechief1001 acme-chief-backend[8198]: SIGHUP received
Jul 26 09:00:01 acmechief1001 systemd[1]: Reloaded acme-chief Service.
Jul 26 09:00:01 acmechief1001 acme-chief-backend[8198]: Missing/invalid DNS zone updater CMD timeout, using the default one: 60.00
Jul 26 09:00:01 acmechief1001 acme-chief-backend[8198]: Number of certificates per status: Counter({'VALID': 42, 'CERTIFICATE_STAGED': 2})
Jul 26 09:00:02 acmechief1001 acme-chief-backend[8198]: Enforcing staging_time for unified / ec-prime256v1
Jul 26 09:00:02 acmechief1001 acme-chief-backend[8198]: Staging_time will be enforced for unified / ec-prime256v1 till 2019-08-02 07:00:33
Jul 26 09:00:02 acmechief1001 acme-chief-backend[8198]: Enforcing staging_time for unified / rsa-2048
Jul 26 09:00:02 acmechief1001 acme-chief-backend[8198]: Staging_time will be enforced for unified / rsa-2048 till 2019-08-02 07:00:47

as expected, acme-chief stages the new cert for a whole week as configured:

unified:
        CN: '*.wikipedia.org'
        SNI:
        - '*.wikipedia.org'
        - 'wikimedia.org'
        - 'mediawiki.org'
        - 'wikibooks.org'
        - 'wikidata.org'
        - 'wikinews.org'
        - 'wikiquote.org'
        - 'wikisource.org'
        - 'wikiversity.org'
        - 'wikivoyage.org'
        - 'wiktionary.org'
        - 'wikimediafoundation.org'
        - 'w.wiki'
        - 'wmfusercontent.org'
        - '*.m.wikipedia.org'
        - '*.wikimedia.org'
        - '*.m.wikimedia.org'
        - '*.planet.wikimedia.org'
        - '*.mediawiki.org'
        - '*.m.mediawiki.org'
        - '*.wikibooks.org'
        - '*.m.wikibooks.org'
        - '*.wikidata.org'
        - '*.m.wikidata.org'
        - '*.wikinews.org'
        - '*.m.wikinews.org'
        - '*.wikiquote.org'
        - '*.m.wikiquote.org'
        - '*.wikisource.org'
        - '*.m.wikisource.org'
        - '*.wikiversity.org'
        - '*.m.wikiversity.org'
        - '*.wikivoyage.org'
        - '*.m.wikivoyage.org'
        - '*.wiktionary.org'
        - '*.m.wiktionary.org'
        - '*.wikimediafoundation.org'
        - '*.wmfusercontent.org'
        - 'wikipedia.org'
        staging_time: 604800
        challenge: dns-01
        authorized_regexes:
        - '^cp50[0-1][0-9]\.eqsin\.wmnet$'

update-ocsp is configured to use the certificate only version to perform the OCSP stapling:

`
vgutierrez@cp5001:/etc/update-ocsp.d$ cat unified-new-ec-prime256v1.conf
[Options]
Proxy=webproxy.eqsin.wmnet:8080
Certificates=/etc/acmecerts/unified/new/ec-prime256v1.crt
Output=/etc/acmecerts/unified/new/ec-prime256v1.client.ocsp

But acme-chief only generates that version of the certificate when the cert is moved from new to live and in the meanwhile it only keeps the chained version available:

vgutierrez@cp5001:/etc/update-ocsp.d$ sudo ls -alh /etc/acmecerts/unified/new/
total 56K
drwxr-x--- 8 root root 4.0K Jul 26 09:15 .
drwxr-x--- 6 root root 4.0K Jul 26 08:16 ..
-rw-r----- 1 root root 4.2K Jul 26 08:16 ec-prime256v1.chained.crt
-rw-r----- 1 root root  227 Jul 26 08:16 ec-prime256v1.key
-rw-r----- 1 root root 4.4K Jul 26 08:16 rsa-2048.chained.crt
-rw-r----- 1 root root 1.7K Jul 26 08:16 rsa-2048.key
Vgutierrez raised the priority of this task from Normal to High.Fri, Jul 26, 9:50 AM

This is a big issue, cause right now due to the invalid state of update-ocsp/acme-chief, nginx cannot be restarted in the cp nodes in eqsin

Vgutierrez lowered the priority of this task from High to Normal.Fri, Jul 26, 10:37 AM

So, I've manually generated the missing versions on acmechief1001:

>>> cert = Certificate.load('/var/lib/acme-chief/certs/unified/new/rsa-2048.chained.crt')
>>> cert.save('/var/lib/acme-chief/certs/unified/new/rsa-2048.crt', mode=CERTIFICATE_TYPES['cert_only']['save_mode'])
>>> cert.save('/var/lib/acme-chief/certs/unified/new/rsa-2048.chain.crt', mode=CERTIFICATE_TYPES['chain_only']['save_mode'])
>>> cert = Certificate.load('/var/lib/acme-chief/certs/unified/new/ec-prime256v1.chained.crt')
>>> cert.save('/var/lib/acme-chief/certs/unified/new/ec-prime256v1.crt', mode=CERTIFICATE_TYPES['cert_only']['save_mode'])
>>> cert.save('/var/lib/acme-chief/certs/unified/new/ec-prime256v1.chain.crt', mode=CERTIFICATE_TYPES['chain_only']['save_mode'])

After reverting commit 11c0a2c884d5a80f9f244ccc4dcfad846b03bdce in https://gerrit.wikimedia.org/r/c/operations/puppet/+/525787, we can see how puppet on cp5001 is happy now and the new version of the unified cert gets OCSP stapled as expected:

Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/File[/etc/acmecerts/unified/a40ba19e20ff4516bb7906d154cf5539/ec-prime256v1.chain.crt]/ensure: defined content as '{md5}6a18b35b02d0b252d9342098265599c5'
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/File[/etc/acmecerts/unified/a40ba19e20ff4516bb7906d154cf5539/ec-prime256v1.crt]/ensure: defined content as '{md5}9a61f9c6d243a648f47133346800d4da'
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/File[/etc/acmecerts/unified/a40ba19e20ff4516bb7906d154cf5539/rsa-2048.chain.crt]/ensure: defined content as '{md5}6a18b35b02d0b252d9342098265599c5'
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/File[/etc/acmecerts/unified/a40ba19e20ff4516bb7906d154cf5539/rsa-2048.crt]/ensure: defined content as '{md5}4418b3939aefa2603828e5093359dfbe'
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-new-ec-prime256v1-create-ocsp]/returns: executed successfully
Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Acme_chief::Cert[unified]/Exec[unified-new-rsa-2048-create-ocsp]/returns: executed successfully

This effectively solves the issue commented in my previous comment

Mentioned in SAL (#wikimedia-operations) [2019-07-26T14:10:50Z] <ema> disable puppet on cache nodes T229091

Mentioned in SAL (#wikimedia-operations) [2019-07-26T14:23:20Z] <ema> re-enable puppet on cache nodes T229091

As it's been done with unified, wikibase required the same patch:

>>> from acme_chief.x509 import Certificate
>>> from acme_chief.acme_chief import CERTIFICATE_TYPES
>>> cert = Certificate.load('/var/lib/acme-chief/certs/wikibase/new/rsa-2048.chained.crt')
>>> cert.save('/var/lib/acme-chief/certs/wikibase/new/rsa-2048.crt', mode=CERTIFICATE_TYPES['cert_only']['save_mode'])
>>> cert.save('/var/lib/acme-chief/certs/wikibase/new/rsa-2048.chain.crt', mode=CERTIFICATE_TYPES['chain_only']['save_mode'])
>>> cert = Certificate.load('/var/lib/acme-chief/certs/wikibase/new/ec-prime256v1.chained.crt')
>>> cert.save('/var/lib/acme-chief/certs/wikibase/new/ec-prime256v1.crt', mode=CERTIFICATE_TYPES['cert_only']['save_mode'])
>>> cert.save('/var/lib/acme-chief/certs/wikibase/new/ec-prime256v1.chain.crt', mode=CERTIFICATE_TYPES['chain_only']['save_mode'])
ema moved this task from Triage to TLS on the Traffic board.Tue, Jul 30, 10:38 AM