logstash_checker.py is invoked by scap to run a query against logstash looking for 'too many' errors/exceptions after a deployment has been made to canaries.
However, this doesn't save you from a bad deploy if logstash itself is having a stomachache about indexing, or is backlogged because of a flood of spam on its input.
We've deployed while logstash-blind in the past: at least once unknowingly and once knowingly but with precautions.
Ideally scap would be aware of when this is occurring and warn the user. There's a few possible signals that could be used:
- Query the kafka_burrow_partition_lag{exported_cluster="logging-eqiad"} metrics from Prometheus/ops: grafana link to an interval with issues. The units here are # of events; more than ~1000 is a cause for concern
- Pros: fast, low-noise
- Cons: requires encoding some 'implementation details' of our logstash deployment
- Issue a second query to logstash for the count of # of events, error/exception or not, in the last N minutes, and make sure that is above a certain absolute threshold
- Pros: more end-to-end & more general
- Cons: likely more difficult to set a 'good' threshold