Page MenuHomePhabricator

scap's logstash_checker.py is blissfully unaware of any logstash indexing latency
Open, MediumPublic

Description

logstash_checker.py is invoked by scap to run a query against logstash looking for 'too many' errors/exceptions after a deployment has been made to canaries.

However, this doesn't save you from a bad deploy if logstash itself is having a stomachache about indexing, or is backlogged because of a flood of spam on its input.

We've deployed while logstash-blind in the past: at least once unknowingly and once knowingly but with precautions.

Ideally scap would be aware of when this is occurring and warn the user. There's a few possible signals that could be used:

  1. Query the kafka_burrow_partition_lag{exported_cluster="logging-eqiad"} metrics from Prometheus/ops: grafana link to an interval with issues. The units here are # of events; more than ~1000 is a cause for concern
    • Pros: fast, low-noise
    • Cons: requires encoding some 'implementation details' of our logstash deployment
  2. Issue a second query to logstash for the count of # of events, error/exception or not, in the last N minutes, and make sure that is above a certain absolute threshold
    • Pros: more end-to-end & more general
    • Cons: likely more difficult to set a 'good' threshold

Event Timeline

CDanis created this task.Jun 11 2020, 8:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2020, 8:12 PM
jbond triaged this task as Medium priority.Jun 12 2020, 9:47 AM

@thcipriani: A good first task is a self-contained, non-controversial task with a clear approach and links to documentation and the codebase (see the project description). Given the current task description I'm removing the good first task tag, as there are several possible approaches here.