Page MenuHomePhabricator

Improve Logstash's rate-limiting capabilities
Open, MediumPublic

Description

The generic Logstash pipeline rate-limiting configuration (throttle plugin) is insufficient for a multi-stream, multi-producer pipeline with no required fields.

As of this writing:

  • The most commonly available field as of this writing is service.type.
  • There are no guarantees as to whether message, log.level, log.syslog, or host.name (et. al.) will be available in any event.

Event Timeline

A solution to this could have mitigated an incident today.

ES, over GELF, was spamming Setting a negative [weight] in Function Score Query is deprecated and will throw an error in the next major version at about 3k writes per minute. This was not caught by the current throttling configuration.

A solution to this could have lessened the impact of AQS rapidly logging that it could not talk to cassandra in codfw while data is being restored. The ecs-test index for 2022.24 hit max documents per index.

A solution to this could have lessened the impact of a bug logging PHP Notice: Undefined property: Wikimedia\PSquare::$increments at a high rate leading to high consumer lag.

Change 910077 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: webrequest ecs: move backend to label

https://gerrit.wikimedia.org/r/910077

Change 910077 merged by Cwhite:

[operations/puppet@production] logstash: webrequest ecs: move backend to label

https://gerrit.wikimedia.org/r/910077

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

colewhite renamed this task from Improve Logstash's throttling capabilities to Improve Logstash's rate-limiting capabilities.Jul 25 2023, 10:57 PM
colewhite updated the task description. (Show Details)
lmata triaged this task as High priority.Dec 5 2023, 3:19 PM
lmata moved this task from Backlog to Prioritized on the Observability-Logging board.
lmata subscribed.

Raising priority based on recent conversations with the team and the intent to address this in the near future as part of risk mitigations to the logging pipeline.

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

I think this is probably our most actionable path forward.

With that said could we also expand the task description to outline more clearly our goal(s) for this work? Off hand I think it'd be helpful to have some examples of the rate limiting models/approaches that would have been useful to mitigate recent issues.

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

I don't think this is possible without introducing configuration duplication.

The goal of this task is to expand the influence of the throttling filters to handle other log structures. By moving the throttler before the normalization filters, we will have to change both the throttler and normalization filters to react to log structure changes. Ideally, we would only change normalization filters to have the throttler do the right thing.

Pre-normalization throttling is important too, but I do not think these throttlers are the same. For pre-normalization throttling, this is the task T331879: Investigate methods to rate-limit/discard excessive log messages closer to the producer

colewhite lowered the priority of this task from High to Medium.Wed, Apr 3, 10:45 PM