Page MenuHomePhabricator

Logstash SLO excursion on 2023-02-11
Closed, ResolvedPublic


Logstash missed its latency SLO in eqiad for the Dec 2022-Feb 2023 reporting quarter, driven by a single event on February 11 where it was overwhelmed by about 30k messages/sec (link courtesy @colewhite ).

Nothing else went wrong in eqiad all quarter, but as a result we ended the quarter with 99.2% of events with acceptable latency, compared to the 99.5% target value. That's not an egregious gap, but it's enough that we need to take a look at it.

We'll need to dig into that incident a little and understand what went wrong and how it can be prevented; our usual incident report is probably the right format. Next we'll either

  • take on enough engineering work that we can be reasonably confident we won't miss the SLO if the same situation arises,
  • or if that's impossible, revise the SLO to an achievable one, maybe by including a maximum messages/sec that's consistent with normal high-load operations -- any more throughput than that and the SLO is "best effort,"
  • or some combination of both.

I'm marking this High priority since it represents an SLO violation, but it's reasonable to downgrade it to Medium once we have some rough sense of what we're doing -- we just need to take an initial look promptly enough to be an input for Q4 planning if necessary. Assigning to @lmata as the starting point, but o11y folks will discuss at their team meeting tomorrow.

Event Timeline

RLazarus triaged this task as High priority.Mar 7 2023, 5:45 PM
RLazarus created this task.

Thanks! I'll discuss this with the team and circle back with:

  • Confirmation that our assumptions on capacity and metrics
  • Next steps to remediate whatever affected the SLO or a proposed adjustment to the SLO :-)

The team chatted about this during our weekly meeting today; we're still investigating; this might take a couple of weeks to queue as we've some short-term sprint-week commitments to work through.

We will have a determination before the end of the quarter on the following: either the SLO needs adjusting, or we have some work to do to address whatever happened with the SLO. Therefore, we Pre-emptively penciled in Q4 objective placeholder to handle the result of this investigation either way.

Perfect, thank you! I started and filled in what we know so far (as far as I know, anyway).

Hi @RLazarus

Apologies for the radio silence, I'm now circling back to this, and as I review, I have one or two questions :D.

Will we file a retroactive incident for all SLO "breach"? Does that also go for internal vs. external services if that's the case? I ask because it's my understanding that we do not usually file an incident report for internal service outages that have no visible user impact. Do we want to group this new category of incidents into the same bucket as the others? And finally, should we amend the existing incident/incident status docs to include SLOs as part of the process and start aligning some of those efforts?

I apologize if these would have been better asked when we met; I'm just now wrapping my thoughts around this event and process as we go through the work and conversations. Thanks!

No, it's a good question! I think like any other incident, it's worth writing a report when there's something we can learn. In this case, I think we stand to get two things out of it:

  • One is the discussion of "how did this fail, and what can we do about it" -- in this case, the mechanics of a producer spraying out too much log output, Logstash getting outpaced, and how events unfolded from there. Distilling action items from a thoughtful writeup of the problem is the best systematic way to prevent recurrence, and reading incident reports is one of the best ways for SREs to understand complex systems and how they behave in weird situations.
  • The other is a discussion around detection and alerting: we shouldn't be able to burn through the SLO's entire error budget without knowing about it, as happened here, so there are more action items to be had. (We've already been talking about some approaches to that, as you know obviously.)

All of that analysis is really well suited to the process we already have, and it makes for an incident report with a lot of substance. That'll be true for some SLO excursions and not others, and that's okay, just like we don't write up everything else that goes wrong in production. Being directly visible to users is definitely one important factor, it just isn't the only one.

And, yeah, as we ramp up SLO coverage for the services we maintain, they'll take a bigger role in incident triage -- we started that the last time we reworked the incident report template, by including "affected SLOs" in the header, but that was never going to be the end of it. One big piece is going to be along the same lines as the alerting conversation: if every service is within SLO, then by definition there's no incident. But broader SLO coverage is still the blocker there: we can't say "SLOs are a comprehensive list of what we care about" until it's true.

Does that help? Happy to meet about this too.

Capturing some detail from the meeting today:

This particular incident gave responders less than 5 minutes to research, write a specialized mitigation rule, and deploy the mitigation to the entire MediaWiki appserver fleet in order to prevent missing the SLO. A less targeted mitigation might have met the SLO if implemented in time, but also would have created a significant loss of log and metrics data for the time period.

There are two problems:

  1. That the team wasn't aware that this event led to missing the SLO. The team should be made aware in some way that the SLO has been missed or is on the way to being missed.
  2. That the SLO is impossible to meet given the current tooling should a similar event happen again (high chance of recurrence). Upstream developments present an opportunity to re-tool the pre-kafka part of the stack. This event demonstrates the need.

In the short term, this SLO should probably be made achievable. The long term fix is a large project that we ought to prioritize.

lmata updated Other Assignee, added: lmata.
lmata added a subscriber: herron.

We've made this item and subsequent follow-up an OKR for Q4, handing it off to @herron

Current Logstash SLO appears to be measuring the number of events encountered in lagged state. This SLO affords us around 152 million events total when the rate exceeds 20k/events/sec every quarter. Log production can greatly exceed capacity by many orders of magnitude.

In the last 90 days, we've ingested around 30 billion logs. 99.5% of that is around 152 million logs. We estimate our throughput capacity to be 10x of normal: 20k/events/sec max throughput. To exceed our SLO in 30 minutes, the appservers alone (mw* - 340) need emit only 250/events/sec each.

I propose we instead measure the amount of time spent in lagged state.

herron removed herron as the assignee of this task.Jun 5 2023, 5:00 PM

Lag can indicate problems with Logstash but as we have seen the past two quarters, it usually indicates some log producer is generating lots of logs very quickly.

To meet the SLO as it is currently defined, we could:

  • Implement load-shedding
    • Risks hampering post-incident analysis due to enforced data loss
  • Selectively filter and rate-limit the log stream prior to reaching Kafka
    • Current tooling cannot do this
    • Implementing this without metrics could hide problems
  • Increase ingest capacity by some orders of magnitude
    • Associated cost of scaling up and/or out
    • Constantly moving target
  • Adjust the SLO to meet reality
    • 99.5%->80%
    • Constantly moving target
  • Define a different SLO

I propose we define a different SLO that measures the availability of Logstash or the user-facing components.

  1. How many minutes total did Logstash write less than 1 log per minute? (SLO: 99.95): sum_over_time((increase(sum by (site) (logstash_node_pipeline_events_filtered_total)[2m:1m]) < bool 1)[90d:1m])
    • Indicates Logstash is not processing events
    • Advantages
      • Actionable
      • Generally indicative of problems with Logstash
    • Disadvantages
      • Cannot handle the case of no logs generated anywhere in the infrastructure (unlikely)
  1. How many minutes total did OpenSearch report there was no data for the last 60 seconds? (SLO: 99.95): sum_over_time((log_all_hits < bool 1)[90d:1m])
    • Indicates events timestamped in the last 60 seconds aren't available to view in OpenSearch Dashboards
    • Advantages
      • Actionable
      • Generally indicative of problems with Logstash
      • Definitely a degraded experience for a user looking for the most recent logs
    • Disadvantages
      • Cannot handle the case of no logs generated anywhere in the infrastructure (unlikely)
      • Currently only measures legacy indexes (ecs, w3creportingapi, and alerts excluded. Would have to investigate how to include them.)
      • The metric selects events based on timestamp as set by the log event. For example, a log emitted at 00:20Z could have a timestamp set by the producer to any time in the past or future -- in this case querying @timestamp > now-1m would miss events with a pre-defined timestamp outside of the selection range.
  1. Open to hearing other options.

I propose we define a different SLO that measures the availability of Logstash or the user-facing components.

This seemed like a likely scenario from the start, IMO finding a metric less likely to be affected by upstream or external factors and indicative of internal state and function is ideal. Thank you for getting this proposal together.

Change 934453 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/grafana-grizzly@master] Logstash: implement availability SLO

Change 934453 merged by Cwhite:

[operations/grafana-grizzly@master] Logstash: implement availability SLO

We have updated the SLI to an availability. Changes are applied to the dashboard and SLO worksheet.