Improve Logstash's rate-limiting capabilities
Open, MediumPublic
Actions

Assigned To

None

Authored By

	colewhite
	Mar 18 2021, 10:16 PM

Description

The generic Logstash pipeline rate-limiting configuration (throttle plugin) is insufficient for a multi-stream, multi-producer pipeline with no required fields.

As of this writing:

The most commonly available field as of this writing is service.type.
There are no guarantees as to whether message, log.level, log.syslog, or host.name (et. al.) will be available in any event.

Details

	Subject	Repo	Branch	Lines +/-
	logstash: webrequest ecs: move backend to label	operations/puppet	production	+119 -6

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	colewhite	T277813 Logstash is throttling the whole ECS pipeline
Open	None	T277816 Improve Logstash's rate-limiting capabilities
Resolved	EBernhardson	T290913 High rate of deprecation warning produced by elastic* nodes.
Open	None	T295939 Logstash throttler does not apply to k8s logs
Resolved	colewhite	T313099 Increase of ~50 million access logs per day from mobileapps-production-tls-proxy

Event Timeline

colewhite created this task.Mar 18 2021, 10:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 18 2021, 10:16 PM

colewhite added a parent task: T277813: Logstash is throttling the whole ECS pipeline.Mar 18 2021, 10:16 PM

colewhite moved this task from Inbox to Backlog on the observability board.Mar 23 2021, 10:01 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:21 AM

Maintenance_bot added a project: observability.Jul 12 2021, 2:45 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:08 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Aug 9 2021, 12:54 AM

Maintenance_bot edited projects, added SRE Observability; removed Observability-Logging.Aug 9 2021, 1:45 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Aug 9 2021, 3:14 AM

Maintenance_bot edited projects, added SRE Observability; removed Observability-Logging.Aug 9 2021, 3:45 AM

lmata edited projects, added Observability-Logging; removed SRE Observability.Sep 13 2021, 2:44 AM

A solution to this could have mitigated an incident today.

ES, over GELF, was spamming Setting a negative [weight] in Function Score Query is deprecated and will throw an error in the next major version at about 3k writes per minute. This was not caught by the current throttling configuration.

colewhite added a subtask: T290913: High rate of deprecation warning produced by elastic* nodes..Sep 13 2021, 7:24 PM

EBernhardson closed subtask T290913: High rate of deprecation warning produced by elastic* nodes. as Resolved.Sep 13 2021, 9:02 PM

colewhite added a subtask: T295939: Logstash throttler does not apply to k8s logs.Nov 17 2021, 10:21 PM

A solution to this could have lessened the impact of AQS rapidly logging that it could not talk to cassandra in codfw while data is being restored. The ecs-test index for 2022.24 hit max documents per index.

A solution to this could have lessened the impact of a bug logging PHP Notice: Undefined property: Wikimedia\PSquare::$increments at a high rate leading to high consumer lag.

lmata moved this task from Inbox to Prioritized on the Observability-Logging board.Jul 20 2022, 5:08 PM

lmata added a project: SRE Observability (FY2022/2023-Q2).

colewhite moved this task from Prioritized to Backlog on the Observability-Logging board.Jul 26 2022, 9:15 PM

colewhite added a subtask: T313099: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy.Sep 9 2022, 3:09 PM

colewhite removed a subtask: T313099: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy.

colewhite edited projects, added SRE Observability (FY2022/2023-Q3); removed SRE Observability (FY2022/2023-Q2).Jan 18 2023, 1:30 AM

colewhite edited projects, added SRE Observability (FY2022/2023-Q4); removed SRE Observability (FY2022/2023-Q3).Apr 14 2023, 10:10 PM

Change 910077 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: webrequest ecs: move backend to label

https://gerrit.wikimedia.org/r/910077

gerritbot added a project: Patch-For-Review.Apr 19 2023, 10:54 PM

Change 910077 merged by Cwhite:

[operations/puppet@production] logstash: webrequest ecs: move backend to label

https://gerrit.wikimedia.org/r/910077

Maintenance_bot removed a project: Patch-For-Review.Apr 26 2023, 8:10 PM

lmata edited projects, added SRE Observability (FY2023/2024-Q1); removed SRE Observability (FY2022/2023-Q4).Jul 18 2023, 4:57 PM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2023/2024-Q1) board.Jul 18 2023, 8:49 PM

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

colewhite renamed this task from Improve Logstash's throttling capabilities to Improve Logstash's rate-limiting capabilities.Jul 25 2023, 10:57 PM

colewhite updated the task description. (Show Details)

lmata edited projects, added SRE Observability (FY2023/2024-Q2); removed SRE Observability (FY2023/2024-Q1).Oct 9 2023, 4:19 PM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2023/2024-Q2) board.

Raising priority based on recent conversations with the team and the intent to address this in the near future as part of risk mitigations to the logging pipeline.

In T277816#9040201, @fgiunchedi wrote:

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

I think this is probably our most actionable path forward.

With that said could we also expand the task description to outline more clearly our goal(s) for this work? Off hand I think it'd be helpful to have some examples of the rate limiting models/approaches that would have been useful to mitigate recent issues.

lmata edited projects, added SRE Observability (FY2023/2024-Q3); removed SRE Observability (FY2023/2024-Q2), observability.Jan 10 2024, 3:53 PM

lmata moved this task from Inbox to Up next on the SRE Observability (FY2023/2024-Q3) board.

In T277816#9040201, @fgiunchedi wrote:

Follow up from a chat/discussion: we have more operational experience with Benthos now, and it performs well as a kafka consumer. It can be used as a consumer in front of logstash for throttling / sampling purposes; that will relieve logstash of the heavy weight sampling/throttling, leaving only transformations and ingestion into indexing. Something like kafka <-> benthos <-> logstash http input. When logstash is backlogged its http input is supposed to return a 4xx, which in turn will lead benthos to apply backpressure and stop kafka consuming

I don't think this is possible without introducing configuration duplication.

The goal of this task is to expand the influence of the throttling filters to handle other log structures. By moving the throttler before the normalization filters, we will have to change both the throttler and normalization filters to react to log structure changes. Ideally, we would only change normalization filters to have the throttler do the right thing.

Pre-normalization throttling is important too, but I do not think these throttlers are the same. For pre-normalization throttling, this is the task T331879: Investigate methods to rate-limit/discard excessive log messages closer to the producer

colewhite lowered the priority of this task from High to Medium.Wed, Apr 3, 10:45 PM

colewhite removed a project: SRE Observability (FY2023/2024-Q3).

Improve Logstash's rate-limiting capabilitiesOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Improve Logstash's rate-limiting capabilities
Open, MediumPublic
Actions

Related Objects
Search...