logstash-k8s indexes have nearly quadrupled in size since 60d ago leading to curator forcemerge failures and running out of space on the ssd hosts.
X-Axis is day, Y-Axis is in GB
| colewhite | |
| Mar 27 2025, 4:35 PM |
| F62305955: analyzed_data_2024.12.27-2025.06.12.png | |
| Jun 12 2025, 8:05 PM |
| F59947935: Screenshot From 2025-05-13 21-53-17.png | |
| May 13 2025, 9:57 PM |
| F59947936: Screenshot From 2025-05-13 21-52-41.png | |
| May 13 2025, 9:57 PM |
| F59947937: Screenshot From 2025-05-13 21-51-25.png | |
| May 13 2025, 9:57 PM |
| F58931465: Screenshot from 2025-03-27 16-33-07.png | |
| Mar 27 2025, 4:35 PM |
logstash-k8s indexes have nearly quadrupled in size since 60d ago leading to curator forcemerge failures and running out of space on the ssd hosts.
X-Axis is day, Y-Axis is in GB
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | colewhite | T390215 Logstash is overwhelmed | |||
| Open | None | T390218 Add canary release to api-gateway | |||
| Duplicate | None | T391661 Weekly indices are force-merged by curator every day for a week | |||
| Stalled | None | T391687 Consider sharding big logging indices | |||
| Stalled | None | T391714 Review logging cluster merge pressure | |||
| Declined | None | T392092 Review logging index refresh_intervals | |||
| Open | None | T392230 Explore improved isolation of non-ECS k8s log topics/indices | |||
| Open | ayounsi | T412143 ~5k/logs/sec from netdev | |||
| Open | None | T416443 codfw: Upgrade cloudsw1-b1-codfw (2026) | |||
| Open | None | T416450 esams: upgrade routers & switches (2026) | |||
| Open | None | T415289 logging-hd nodes evict-rejoin troubles |
First suspect is kubernetes.container_name:"production-ratelimit" generating debug logs at around 1500/sec.
Change #1131777 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):
[operations/deployment-charts@master] api-gateway: Lower ratelimit log level
Change #1131777 merged by jenkins-bot:
[operations/deployment-charts@master] api-gateway: Lower ratelimit log level
Log volume is down quite a bit since raising the production-ratelimit log level filter. Will check back in on this in a few days to re-evaluate in case there are other applications doing the same thing.
Checking back today, istio-ingressgateway is logging about 1000/logs/sec these days.
istio-ingressgateway was last discussed in July 2023, during its early stages. I think its time for another look as istio-ingressgateway is solidly half of all logs we get via kubernetes.
@elukey would it be reasonable to either reduce the number of istio-ingressgateway logs stored either via sampling or can we implement reduced retention to less than 90d? Either way is reasonably easy to accomplish.
Spoke too soon about istio - they're not in the k8s indexes!
The biggest contributor to the logstash-k8s indexes now is revertrisk-language-agnostic-predictor-default-00025 at more than half of all logs from k8s at ~378/logs/sec (~32 million per day).
We'll check back in next week to see if the index size is overall back to within reasonable.
Change #1135740 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: move err field mitigation to effective place
Change #1135748 had a related patch set uploaded (by Cwhite; author: Filippo Giunchedi):
[operations/puppet@production] logstash: temporarily disable curator forcemerge
Change #1135748 merged by Cwhite:
[operations/puppet@production] logstash: temporarily disable curator forcemerge
Change #1135740 merged by Cwhite:
[operations/puppet@production] logstash: move err field mitigation to effective place
Change #1135948 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] grafana: set max_source_resolution=auto for thanos ds
FTR turns out this wasn't the cause, in fact disabling forcemerge had little impact one way or another AFAICT. I've followed up on T391661 re: enabling it again
We're at it again today, best correlation (i.e. not necessarily causation!) is indexing time vs time spent querying vs kafka lag, e.g. here for today's events https://w.wiki/DpuD
My current theory is that heavy queries over large indices all with one shard cause opensearch to slow down significantly, enough to cause indexing to slow down and thus kafka backpressure.
What might help to further research this theory is turning on shard slow logs: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#shard-slow-logs
Ideally also search request slow logs, though that's a opensearch 2.12 feature: https://docs.opensearch.org/docs/latest/install-and-configure/configuring-opensearch/logs/#search-request-slow-logs
Change #1137057 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop out_request field
Change #1137057 merged by Cwhite:
[operations/puppet@production] logstash: drop out_request field
Noting for future reference that Logstash has not received almost any events from MediaWiki in the past two hours or so (since 17:05 UTC), and this is apparently the reason.
Change #1137064 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: also remove outRequest field
Change #1137064 merged by Cwhite:
[operations/puppet@production] logstash: also remove outRequest field
Change #1137065 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: expand conditional
Change #1137065 merged by Cwhite:
[operations/puppet@production] logstash: expand conditional
Update: we saw immediate ingest improvement after removing the out_request and outRequest fields which were generated by mobileapps. We should watch for more inexplicable dips in throughput in the coming days if this was an incomplete mitigation.
We uncovered these fields as problematic by chance. These fields happened to show up in an analysis of the DLQ as they exceeded the 2048 index key limit.
To detection, more graphs should be provisioned to watch OpenSearch for spikes in time spent parsing and writing to index.
Adoption of ECS and improvements around schema validation will prevent issues like this from occurring in the future.
FWIW I have held off on those for now, easy enough to turn slow logs whenever we need it though
Checking back on this, it seems the situation hasn't improved much.
I looked around at the logs and found that the ml hosts are more than half the logs in the k8s indexes. I think this may be a good place to partition.
Change #1145339 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: create partition for ml logs
Change #1145339 merged by Cwhite:
[operations/puppet@production] logstash: create partition for ml logs
Change #1152079 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop request.query field
Change #1152079 merged by Cwhite:
[operations/puppet@production] logstash: drop request.query field
Change #1152088 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: remove mw-script large fields
Change #1152088 merged by Cwhite:
[operations/puppet@production] logstash: remove mw-script large fields
Having merged T395579: https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors not showing new data in here, I note a) maybe that was premature, and b) that one's UBN from RelEng perspective, since it's a blocker for deployments.
Change #1152093 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: mw-script remove text field
Change #1152093 merged by Cwhite:
[operations/puppet@production] logstash: mw-script remove text field
Change #1152095 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: remove heading field
Change #1152095 merged by Cwhite:
[operations/puppet@production] logstash: mw-script remove heading and outgoing_link field
Change #1152108 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop logs from DumpIndex.php
Change #1153279 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: Reroute apache.access logs to webrequest partition
This task's scope expanded a bit since its initial inception. Some applications were logging large JSON objects that would slow down parsing within OpenSearch generating backpressure on Logstash and reducing ingest capacity to well below the rate of enqueue. This has largely been mitigated and a permanent fix is in the works.
Current state is we're still ingesting a volume of logs that keeps us closer to overrunning available disk than is comfortable.
Change #1153279 merged by Cwhite:
[operations/puppet@production] logstash: Reroute apache.access logs to webrequest partition
Change #1160230 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: relocate ecs-default indexes to hdd nodes after 8 weeks
Change #1160875 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop mobileapps detail field
Change #1160875 merged by Cwhite:
[operations/puppet@production] logstash: drop mobileapps detail field
7 weeks of ecs-default indexes have been moved to the HDD nodes and replication reduced on one week of late May indexes to mitigate imminent high watermark problems.
Change #1163053 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: reduce replicas for high-volume logstash-ml logs
Change #1163053 merged by Cwhite:
[operations/puppet@production] logstash: reduce replicas for high-volume logstash-ml logs
Change #1165532 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: reroute mobileapps webrequests
Change #1165532 merged by Cwhite:
[operations/puppet@production] logstash: reroute mobileapps webrequests
Change #1175597 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop logspam
Change #1175597 merged by Cwhite:
[operations/puppet@production] logstash: drop logspam
Change #1189534 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop 99% eventgate-analytics-external logs
Change #1189534 merged by Cwhite:
[operations/puppet@production] logstash: drop 99% eventgate-analytics-external logs
Change #1201827 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: reduce logstash-ml index utilization on ssd-class nodes
Change #1201827 merged by Cwhite:
[operations/puppet@production] logstash: reduce logstash-ml index utilization on ssd-class nodes
Change #1213589 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: drop netdev logspam
Change #1213589 merged by Cwhite:
[operations/puppet@production] logstash: drop netdev logspam
Change #1217539 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: move logstash-ml logs to hdd-class nodes after 7d
Change #1217539 merged by Cwhite:
[operations/puppet@production] logstash: move logstash-ml logs to hdd-class nodes after 7d
Change #1229198 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] logstash: remove logstash-ml indices after 70d
Change #1229198 merged by Cwhite:
[operations/puppet@production] logstash: remove logstash-ml indices after 70d