Page MenuHomePhabricator

SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left
Closed, ResolvedPublic

Description

There are alerts on all ncredir* servers regarding the OCSP staple validity for www.wikipedia.bg

Additionally, we got alerts for PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend. Looking into it a bit, we determined that there was a lingering grep acme-chief-backend running, thus making the check fail.

Event Timeline

I can see that the challenges get set on the dns hosts by e.g. dig @208.80.154.238 -t txt _acme-challenge.wiki-pedia.org a little past the hour and getting appropriate responses back for the text record.

(Side note: Why do they never change? Also, why are there 4 lines with two values, i.e.

;; ANSWER SECTION:
_acme-challenge.wiki-pedia.org.	0 IN	TXT	"challeng-string-A-here'
_acme-challenge.wiki-pedia.org.	0 IN	 TXT	"challenge-string-B-here"
_acme-challenge.wiki-pedia.org.	0 IN	 TXT	"challenge-string-A-here-again"
_acme-challenge.wiki-pedia.org.	0 IN	 TXT	"chllenge-stringB-here-again"

A question for another time.)

Anyways, according to the logs on acmechief1001, validation fails for acme-challenge.wiki-pedia.org for all dns servers (I checked and it's all the IPs for dns[1-5}00{1,2}).wikimedia.org).
Sample fail line from the log:

Jan 30 00:00:35 acmechief1001 acme-chief-backend[22912]: DNS server 2620:0:863:1:198:35:26:7 (ACMEChallengeValidation.INVALID) failed to validate challenge Challenge type: ACMEChallengeType.DNS01. _acme-challenge.wiki-pedia.org TXT <stuff-here>

even though that <stuff-here> string is one of the two challenge strings I see when asking for the TXT record myself.

So is it something in the routes, in ports, ...? I take note of 240614 but also there have been various network tweaks recently. Pinging @BBlack who hopefully will have some insight.

we have several bugs here:

  1. acme-chief should refresh the OCSP stapling response even if it is unable to renew the certificate
  2. acme-chief should issue the certificate skipping wiki-pedia.org cause skip_invalid_snis: true is set for non-canonical-redirect-3
  3. acme-chief is failing to prevalidate _acme-challenge.wiki-pedia.org

If I can add a 4th and 5th, with lower priority, and feel free to disagree- "Ensure acme-chief-backend is running only in the active node" check should not use the -a parameter, but match the 1st or 2 first arguments only. Also maybe some extra monitoring related to repeating failures (?) (I believe someone saw errors on execution). Minor things, comparing to avoiding an outage here, which is the top priority for now.

hmm actually I'm wrong, the prevalidation works as expected for wiki-pedia.org, it's the actual DNS challenge validation that fails on acme-chief side (so it never signals Let's Encrypt to perform the validation on their side)

Mentioned in SAL (#wikimedia-operations) [2020-01-30T16:26:56Z] <vgutierrez> manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948

I've ran a manual OCSP refresh for non-canonical-redirects-3 running:

sudo http_proxy=http://webproxy.eqiad.wmnet:8080 python3 ~vgutierrez/ocsp.py non-canonical-redirect-3
sudo cp rsa-2048.ocsp ec-prime256v1.ocsp /var/lib/acme-chief/certs/non-canonical-redirect-3/live/

After that from a cumin master:

sudo -i cumin 'A:ncredir' "run-puppet-agent -q"
jijiki triaged this task as Medium priority.Feb 1 2020, 1:30 PM

Mentioned in SAL (#wikimedia-operations) [2020-02-04T09:08:19Z] <vgutierrez> manually refreshing OCSP stapling response for non-canonical-redirects-3 - T243948

Vgutierrez claimed this task.

After solving T240614, acme-chief has been able to renew non-canonical-redirect-3 so OCSP stapling refresh is fixed as well. I'm closing this alert task but T244232 will track the OCSP response fetch fix

BCornwall closed this task as Resolved.