Page MenuHomePhabricator

Logstash is overwhelmed
Open, MediumPublic

Description

logstash-k8s indexes have nearly quadrupled in size since 60d ago leading to curator forcemerge failures and running out of space on the ssd hosts.

Screenshot from 2025-03-27 16-33-07.png (370×1 px, 67 KB)

X-Axis is day, Y-Axis is in GB

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+36 -0
operations/puppetproduction+2 -2
operations/puppetproduction+5 -0
operations/puppetproduction+47 -0
operations/puppetproduction+8 -0
operations/puppetproduction+5 -0
operations/puppetproduction+61 -0
operations/puppetproduction+36 -2
operations/puppetproduction+8 -7
operations/puppetproduction+50 -0
operations/puppetproduction+78 -1
operations/puppetproduction+3 -6
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+9 -2
operations/puppetproduction+9 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+8 -0
operations/puppetproduction+55 -125
operations/puppetproduction+20 -17
operations/deployment-chartsmaster+3 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

First suspect is kubernetes.container_name:"production-ratelimit" generating debug logs at around 1500/sec.

colewhite updated the task description. (Show Details)

Change #1131777 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] api-gateway: Lower ratelimit log level

https://gerrit.wikimedia.org/r/1131777

Change #1131777 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: Lower ratelimit log level

https://gerrit.wikimedia.org/r/1131777

colewhite changed the task status from In Progress to Open.Mar 27 2025, 5:10 PM
colewhite lowered the priority of this task from High to Medium.

Log volume is down quite a bit since raising the production-ratelimit log level filter. Will check back in on this in a few days to re-evaluate in case there are other applications doing the same thing.

Checking back today, istio-ingressgateway is logging about 1000/logs/sec these days.

istio-ingressgateway was last discussed in July 2023, during its early stages. I think its time for another look as istio-ingressgateway is solidly half of all logs we get via kubernetes.

@elukey would it be reasonable to either reduce the number of istio-ingressgateway logs stored either via sampling or can we implement reduced retention to less than 90d? Either way is reasonably easy to accomplish.

Spoke too soon about istio - they're not in the k8s indexes!

The biggest contributor to the logstash-k8s indexes now is revertrisk-language-agnostic-predictor-default-00025 at more than half of all logs from k8s at ~378/logs/sec (~32 million per day).

We'll check back in next week to see if the index size is overall back to within reasonable.

Change #1135740 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move err field mitigation to effective place

https://gerrit.wikimedia.org/r/1135740

Change #1135748 had a related patch set uploaded (by Cwhite; author: Filippo Giunchedi):

[operations/puppet@production] logstash: temporarily disable curator forcemerge

https://gerrit.wikimedia.org/r/1135748

Change #1135748 merged by Cwhite:

[operations/puppet@production] logstash: temporarily disable curator forcemerge

https://gerrit.wikimedia.org/r/1135748

Change #1135740 merged by Cwhite:

[operations/puppet@production] logstash: move err field mitigation to effective place

https://gerrit.wikimedia.org/r/1135740

Change #1135948 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] grafana: set max_source_resolution=auto for thanos ds

https://gerrit.wikimedia.org/r/1135948

Change #1135740 merged by Cwhite:

[operations/puppet@production] logstash: move err field mitigation to effective place

https://gerrit.wikimedia.org/r/1135740

FTR turns out this wasn't the cause, in fact disabling forcemerge had little impact one way or another AFAICT. I've followed up on T391661 re: enabling it again

We're at it again today, best correlation (i.e. not necessarily causation!) is indexing time vs time spent querying vs kafka lag, e.g. here for today's events https://w.wiki/DpuD

My current theory is that heavy queries over large indices all with one shard cause opensearch to slow down significantly, enough to cause indexing to slow down and thus kafka backpressure.

What might help to further research this theory is turning on shard slow logs: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#shard-slow-logs

Ideally also search request slow logs, though that's a opensearch 2.12 feature: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#search-request-slow-logs

Change #1137057 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop out_request field

https://gerrit.wikimedia.org/r/1137057

Change #1137057 merged by Cwhite:

[operations/puppet@production] logstash: drop out_request field

https://gerrit.wikimedia.org/r/1137057

Noting for future reference that Logstash has not received almost any events from MediaWiki in the past two hours or so (since 17:05 UTC), and this is apparently the reason.

Change #1137064 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: also remove outRequest field

https://gerrit.wikimedia.org/r/1137064

Change #1137064 merged by Cwhite:

[operations/puppet@production] logstash: also remove outRequest field

https://gerrit.wikimedia.org/r/1137064

Change #1137065 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: expand conditional

https://gerrit.wikimedia.org/r/1137065

Change #1137065 merged by Cwhite:

[operations/puppet@production] logstash: expand conditional

https://gerrit.wikimedia.org/r/1137065

Update: we saw immediate ingest improvement after removing the out_request and outRequest fields which were generated by mobileapps. We should watch for more inexplicable dips in throughput in the coming days if this was an incomplete mitigation.

We uncovered these fields as problematic by chance. These fields happened to show up in an analysis of the DLQ as they exceeded the 2048 index key limit.

To detection, more graphs should be provisioned to watch OpenSearch for spikes in time spent parsing and writing to index.

Adoption of ECS and improvements around schema validation will prevent issues like this from occurring in the future.

We're at it again today, best correlation (i.e. not necessarily causation!) is indexing time vs time spent querying vs kafka lag, e.g. here for today's events https://w.wiki/DpuD

My current theory is that heavy queries over large indices all with one shard cause opensearch to slow down significantly, enough to cause indexing to slow down and thus kafka backpressure.

What might help to further research this theory is turning on shard slow logs: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#shard-slow-logs

Ideally also search request slow logs, though that's a opensearch 2.12 feature: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#search-request-slow-logs

FWIW I have held off on those for now, easy enough to turn slow logs whenever we need it though

Checking back on this, it seems the situation hasn't improved much.

Screenshot From 2025-05-13 21-51-25.png (363×1 px, 79 KB)

I looked around at the logs and found that the ml hosts are more than half the logs in the k8s indexes. I think this may be a good place to partition.

Screenshot From 2025-05-13 21-52-41.png (482×1 px, 35 KB)

Screenshot From 2025-05-13 21-53-17.png (482×1 px, 35 KB)

Change #1145339 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: create partition for ml logs

https://gerrit.wikimedia.org/r/1145339

Change #1145339 merged by Cwhite:

[operations/puppet@production] logstash: create partition for ml logs

https://gerrit.wikimedia.org/r/1145339

Change #1152079 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop request.query field

https://gerrit.wikimedia.org/r/1152079

Change #1152079 merged by Cwhite:

[operations/puppet@production] logstash: drop request.query field

https://gerrit.wikimedia.org/r/1152079

Change #1152088 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: remove mw-script large fields

https://gerrit.wikimedia.org/r/1152088

Change #1152088 merged by Cwhite:

[operations/puppet@production] logstash: remove mw-script large fields

https://gerrit.wikimedia.org/r/1152088

Change #1152093 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: mw-script remove text field

https://gerrit.wikimedia.org/r/1152093

Change #1152093 merged by Cwhite:

[operations/puppet@production] logstash: mw-script remove text field

https://gerrit.wikimedia.org/r/1152093

Change #1152095 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: remove heading field

https://gerrit.wikimedia.org/r/1152095

Change #1152095 merged by Cwhite:

[operations/puppet@production] logstash: mw-script remove heading and outgoing_link field

https://gerrit.wikimedia.org/r/1152095

Change #1152108 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop logs from DumpIndex.php

https://gerrit.wikimedia.org/r/1152108

Change #1153279 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: Reroute apache.access logs to webrequest partition

https://gerrit.wikimedia.org/r/1153279

@colewhite Can you give us a summary of the state of this problem?

This task's scope expanded a bit since its initial inception. Some applications were logging large JSON objects that would slow down parsing within OpenSearch generating backpressure on Logstash and reducing ingest capacity to well below the rate of enqueue. This has largely been mitigated and a permanent fix is in the works.

Current state is we're still ingesting a volume of logs that keeps us closer to overrunning available disk than is comfortable.

Change #1153279 merged by Cwhite:

[operations/puppet@production] logstash: Reroute apache.access logs to webrequest partition

https://gerrit.wikimedia.org/r/1153279

Change #1160230 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: relocate ecs-default indexes to hdd nodes after 8 weeks

https://gerrit.wikimedia.org/r/1160230

Change #1160875 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop mobileapps detail field

https://gerrit.wikimedia.org/r/1160875

Change #1160875 merged by Cwhite:

[operations/puppet@production] logstash: drop mobileapps detail field

https://gerrit.wikimedia.org/r/1160875

7 weeks of ecs-default indexes have been moved to the HDD nodes and replication reduced on one week of late May indexes to mitigate imminent high watermark problems.

Change #1163053 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: reduce replicas for high-volume logstash-ml logs

https://gerrit.wikimedia.org/r/1163053

Change #1163053 merged by Cwhite:

[operations/puppet@production] logstash: reduce replicas for high-volume logstash-ml logs

https://gerrit.wikimedia.org/r/1163053

Change #1165532 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: reroute mobileapps webrequests

https://gerrit.wikimedia.org/r/1165532

Change #1165532 merged by Cwhite:

[operations/puppet@production] logstash: reroute mobileapps webrequests

https://gerrit.wikimedia.org/r/1165532

Change #1175597 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop logspam

https://gerrit.wikimedia.org/r/1175597

Change #1175597 merged by Cwhite:

[operations/puppet@production] logstash: drop logspam

https://gerrit.wikimedia.org/r/1175597

Change #1189534 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop 99% eventgate-analytics-external logs

https://gerrit.wikimedia.org/r/1189534

Change #1189534 merged by Cwhite:

[operations/puppet@production] logstash: drop 99% eventgate-analytics-external logs

https://gerrit.wikimedia.org/r/1189534

Change #1201827 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: reduce logstash-ml index utilization on ssd-class nodes

https://gerrit.wikimedia.org/r/1201827

Change #1201827 merged by Cwhite:

[operations/puppet@production] logstash: reduce logstash-ml index utilization on ssd-class nodes

https://gerrit.wikimedia.org/r/1201827

Change #1213589 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop netdev logspam

https://gerrit.wikimedia.org/r/1213589

Change #1213589 merged by Cwhite:

[operations/puppet@production] logstash: drop netdev logspam

https://gerrit.wikimedia.org/r/1213589

Change #1217539 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move logstash-ml logs to hdd-class nodes after 7d

https://gerrit.wikimedia.org/r/1217539

Change #1217539 merged by Cwhite:

[operations/puppet@production] logstash: move logstash-ml logs to hdd-class nodes after 7d

https://gerrit.wikimedia.org/r/1217539

Change #1229198 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: remove logstash-ml indices after 70d

https://gerrit.wikimedia.org/r/1229198

Change #1229198 merged by Cwhite:

[operations/puppet@production] logstash: remove logstash-ml indices after 70d

https://gerrit.wikimedia.org/r/1229198