The MediaWiki canary check performed by scap ensures that the error rate hasn't increased significantly (read: 10x) since the last deployment. The problem is that in the case of a successful deployment (i.e., one that hits all of production) that causes a significant increase in the error rate, subsequent deployments error rates are judged against a bad baseline.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Scap canary: cache last good deploy time | operations/puppet | production | +133 -4 |
Related Objects
- Mentioned In
- rMSCAa70af88cc17b: Move logstash checker code into scap
T212147: Allow scap sync to deploy gradually
T121597: Implement MediaWiki pre-promote checks
T183952: Investigate deployment that caused high error-rate and was prevented from going past canaries by Scap - Mentioned Here
- T121597: Implement MediaWiki pre-promote checks
Event Timeline
@mmodell A dynamic query for 24h ago seems somewhat fragile given that 1) traffic volume has strong seasonality (varying upto 4x on a regular basis), and 2) 24h could also coincidentally match a bad deploy?
Alternative idea: Default to the same as now, but conditionally override to a specific timestamp if the last deploy was a bad deploy. E.g. a timestamp value we store somewhere in case of a bad deploy, and if that value is set, check from before that time instead. If it isn't set (or has been cleared by a subsequent good deploy to all prod), then it'd check recent as usual.
I don't know if that is feasible, given we'd need a place to store it, but maybe we've got a place to store it already?
Change 403574 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Scap canary: cache last good deploy time
@thcipriani @greg In light of yesterday's incident, gentle reminder for T121597.
The canary/logstash check can catch a wide range of errors from a wide range of sources, and it has been impressive for that. I imagine that its way of measuring relative change will help a lot with ensuring we keep lowering our logspam from mediawiki, and not regress. And we're doing pretty good there.
On the other hand, even after a dozen fatal incidents all leading to incremental improvements of the canary/logstash checker, it still doesn't reliably catch the big fatals, which don't require a wide net to catch. A simple version of the pre-promote check would go a long way to preventing incidents. They're easy to implement, and easy to get right.
I've updated T121597 with the latest details. I'm open to help with the details if needed (plan, implementation, code-review etc.), Let me know :)
Change 403574 abandoned by Thcipriani:
Scap canary: cache last good deploy time
Reason:
Moving most of this logic into scap, will add a command line flag for baseline timestamp to logstash_checker
dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/358
Move logstash checker code into scap
thcipriani merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/358
Move logstash checker code into scap
As of scap 4.89.0, the threshold canary error rate is set by canary_threshold which has a default value of 10 which we expect to be suitable for production based on analysis of the last 90 days of canary error rates. You can run scap analyze-logstash to get a fresh recommendation.