Page MenuHomePhabricator

cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch
Closed, ResolvedPublic

Description

Command openssl ocsp -nonce -resp_text -respout /var/cache/ocsp/update-ocsp-PAZBNZ.tmp/unified.ocsp -issuer /etc/ssl/certs/b85455c4.0 -path http://ocsp2.globalsign.com/gsorganizationvalsha2g2 -host webproxy.codfw.wmnet:8080 -cert /etc/ssl/localcerts/ecc-uni.wikimedia.org.crt -cert /etc/ssl/localcerts/uni.wikimedia.org.crt failed with exit code 1, stderr:
Error querying OCSP responder
139853469415056:error:27076072:OCSP routines:PARSE_HTTP_LINE1:server response error:ocsp_ht.c:314:Code=504,Reason=Gateway Time-out
Traceback (most recent call last):
  File "/usr/local/sbin/update-ocsp", line 285, in <module>
    main()
  File "/usr/local/sbin/update-ocsp", line 278, in main
    certs_fetch_ocsp(out_tempfile, args)
  File "/usr/local/sbin/update-ocsp", line 212, in certs_fetch_ocsp
    (ocsp_err))
Exception: Did not find verification OK in stderr:
Error querying OCSP responder
139853469415056:error:27076072:OCSP routines:PARSE_HTTP_LINE1:server response error:ocsp_ht.c:314:Code=504,Reason=Gateway Time-out

OCSP update failed for /etc/update-ocsp.d/unified.conf
Traceback (most recent call last):
  File "/usr/share/varnish/zerofetch.py", line 141, in <module>
    main()
  File "/usr/share/varnish/zerofetch.py", line 108, in main
    json_data[ztype] = fetcher.zeroconf(ztype)
  File "/usr/share/varnish/zerofetch.py", line 74, in zeroconf
    return self._apiJSON('get', 'zeroportal', {'type': ztype})
  File "/usr/share/varnish/zerofetch.py", line 27, in _apiJSON
    resp = getattr(self.sess, method)(self.baseurl, params=params)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 469, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 457, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 569, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 407, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

Can we change the scripts to avoid the output in case of Connection Errors or are those valuable in root@?

Thanks!

Luca

Event Timeline

elukey created this task.Apr 16 2016, 6:50 AM
elukey renamed this task from cronspam from cpXXXX hosts due to /usr/local/sbin/update-ocsp-all to cronspam from cpXXXX hosts due to update-ocsp-all and zero_fetch.Apr 16 2016, 6:52 AM
elukey updated the task description. (Show Details)
BBlack added a subscriber: BBlack.Apr 16 2016, 11:23 AM

OCSP:

We have an icinga alert for when things go really-wrong (multiple serial failures), and the commentary in modules/tlsproxy/manifests/ocsp.pp talks about the cron timing vs alert timing and all that:

# Generate icinga alert if OCSP files falling out of date due to errors
#
# Note this makes no provision for un-configured stapling at this time, so
# it will generate warnings if you don't clean up old /var/cache/ocsp/
# entries after removing a tlsproxy::ocsp_stapler cert from a server!
#
# The cron above attempts to get fresh data every hour, and a good fresh
# fetch of data has a 12H lifetime with the windows we're seeing from
# GlobalSign today.
#
# The crit/warn values of 29100 and 14700 correspond are "8h5m" and
# "4h5m", so those are basically warning if 4 updates in a row failed
# for a given cert, and critical if 8 updates in a row failed (at which
# point we have 4h left to fix the situation before the validity window
# expires).

While we're here, it might be wise to make those alert levels a little more-sensitive than they are today for now. I think I was worried about spamming icinga alerts too much during initial implementation, but 4h isn't much notice to fix things on CRIT, either. We should perhaps warn at 3 updates in a row, and crit at 5 updates in a row, leaving us a ~7h window to fix the CRIT? Will upload a trivial patch for that.

The only other reason (aside from alerting) to have the cronspam on failure is that we don't otherwise have a log to look back through on why/how each failure happened if we're debugging icinga alerts about the files. We could perhaps add some trivial logging to the current cron (for all output of update-ocsp-all, appending to e.g. /var/log/update-ocsp-all, and with a logrotate config to keep only a week?)

Note also T93927 about general improvements to our OCSP infra. Not worked on in a long time because it hasn't been a pragmatic issue in a long time.

Zerofetcher:

Similar in nature to OCSP, in that it's a cron script that updates files on disk periodically. The notable differences are these:

  • The source of the updates is in-house, so there should be far fewer failures and we can actually debug the failure causes better.
  • Actual changes to the data are ultimately triggered by human administrative action (someone on the Zero team edits some JSON in an internal wiki, zerofetcher syncs down that JSON to all the caches).
  • We have no icinga alert on the freshness (in mtime terms) of the output files, because by design the actual output files are untouched if there was no upstream data change (because the varnish vmod that re-loads the data is mtime-sensitive, so not touching the files on no-op avoids pointless reloads). We could perhaps fix this by having the script touch a different file on every successful run and icinga-alerting on the freshness of that. The baseline timing is every 15 minutes and it's not quite so critical if this goes stale (in operational terms!), so perhaps warn if it goes stale for 4 hours, CRIT if it goes stale for a day?
  • And then once again, we have no built-in logging for this, so the cron output is also effectively a log channel. Adding rotated local logging to it would address that.

TL;DR - TODO List:

  1. Make OCSP icinga alert more aggressive (trivial patch inc right after this reply)
  2. Modify update-ocsp-all script/cron to log all output to /var/log/update-ocsp-all.log (preventing cronspam output to stdio), and add logrotate script for it.
  3. Modify the zerofetch.py script to touch some file on any successful run, perhaps /var/cache/ocsp/.update-success.
  4. Add an icinga alert on stale-ness of that file (OCSP uses an NRPE script that would work, if a little inefficiently)
  5. As with OCSP, send the cron output to some local logfile and rotate it.

Change 283767 had a related patch set uploaded (by BBlack):
OCSP Stapling: make icinga alerts more aggressive

https://gerrit.wikimedia.org/r/283767

Change 283767 merged by BBlack:
OCSP Stapling: make icinga alerts more aggressive

https://gerrit.wikimedia.org/r/283767

Change 291752 had a related patch set uploaded (by Ema):
update-ocsp-all: write output to logfile

https://gerrit.wikimedia.org/r/291752

Change 291752 merged by Ema:
update-ocsp-all: write output to logfile

https://gerrit.wikimedia.org/r/291752

ema moved this task from Triage to In Progress on the Traffic board.Jun 13 2016, 2:18 PM

Change 294063 had a related patch set uploaded (by Ema):
zerofetch.py: track successful executions

https://gerrit.wikimedia.org/r/294063

Change 294072 had a related patch set uploaded (by Ema):
zerofetch icinga check

https://gerrit.wikimedia.org/r/294072

Change 294077 had a related patch set uploaded (by Ema):
zerofetch: write output to logfile

https://gerrit.wikimedia.org/r/294077

Change 294063 merged by Ema:
zerofetch.py: track successful executions

https://gerrit.wikimedia.org/r/294063

Change 294072 merged by Ema:
zerofetch icinga check

https://gerrit.wikimedia.org/r/294072

Change 294077 merged by Ema:
zerofetch: write output to logfile

https://gerrit.wikimedia.org/r/294077

ema closed this task as Resolved.Jun 14 2016, 4:26 PM
ema claimed this task.

Both update-ocsp-all and zerofetch are now logging to syslog instead of cronspamming.