Page MenuHomePhabricator

Reduce logstash logs from machine learning infra
Open, Needs TriagePublic

Assigned To
None
Authored By
Ladsgroup
Feb 3 2026, 6:29 PM
Referenced Files
F72504419: Screenshot From 2026-03-04 15-15-32.png
Wed, Mar 4, 2:30 PM
F72503325: Screenshot From 2026-03-04 14-39-28.png
Wed, Mar 4, 1:39 PM
F72206235: pod-name-3.png
Fri, Feb 20, 11:16 AM
F72206231: pod-name-2.png
Fri, Feb 20, 11:16 AM
F72206229: pod-name-1.png
Fri, Feb 20, 11:16 AM

Description

See T390215: Logstash is overwhelmed, we are having a lot of trouble with the sheer number of logs being ingested by our logstash infra. I did a quick check and around half of all of the logs are from ML infra: https://logstash.wikimedia.org/goto/1a2483a4a3958f77ea6df119d7b16a22 this is currently emitting around 1,000,000 logs per minute.

E.g. maybe for access requests, implement a sampling for anything that's 200?

Also a much simpler mitigation: I'm seeing that majority of the logs have empty message and empty log: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-ml-1-7.0.0-1-2026.02.03?id=Cii-JJwBVE0pYbVvI1fq

Thank you!

Related Objects

Event Timeline

Also if you check UA, most logs are simply from "MediaWiki/1.46.0-wmf.13" or "ChangePropagation/WMF". Can we sample these?

I can see that most logs have "kubernetes.pod_name" values like:
"controller-..."
"webhook-..."
"kserve-controller-manager-..."
"istio-ingressgateway-..."
These are system logs generated by kserve, which seem to be the majority.
@DPogorzelski-WMF Is there a way to reduce the system logs from kserve?
@elukey sorry to ping you, but maybe you have some insights here

pod-name-1.png (1×1 px, 307 KB)

pod-name-2.png (1×1 px, 337 KB)

pod-name-3.png (1×1 px, 334 KB)

I think that the best course of action is to split logs by namespace:

I think we should focus to find config options for knative and kserve at the moment, it should reduce the firehose by a lot. Most of those logs are something that we don't need, they are emitted by those pods by default IIRC.

Change #1247995 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: reduce logs emitted by knative components on ml-serve

https://gerrit.wikimedia.org/r/1247995

Tested staging and the knative traffic volume dropped:

Screenshot From 2026-03-04 14-39-28.png (572×2 px, 94 KB)

Change #1247995 merged by Elukey:

[operations/deployment-charts@master] admin_ng: reduce logs emitted by knative components on ml-serve

https://gerrit.wikimedia.org/r/1247995

Knative should be good now:

Screenshot From 2026-03-04 15-15-32.png (762×5 px, 279 KB)

For the kserve controller we sadly cannot do much from what I can see, since we'd need https://github.com/kserve/kserve/commit/5e7207ca0870e37e72e119845f7f8933ad57ca6a that is available from 0.12 onward (and we have 0.11.2). We could backport it to our image, so after that we'll have all the zap-logging options to use.

@DPogorzelski-WMF @klausman we already have kserve 0.13 in production-images, so in theory we could simply upgrade the control plane + helm chart to include the above commit and reduce the spam to logstash considerably. Lemme know how you want to proceed :)

Change #1250573 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash: drop kserve-controller's logs

https://gerrit.wikimedia.org/r/1250573