Page MenuHomePhabricator

Revisit default Istio histogram buckets
Open, Stalled, Needs TriagePublic

Description

This is a copy of T391333 but for Istio, since we have the same problem.

Every metric listed below registers ~751740 time series for each hour :

leistio_response_bytes_bucket
+Inf3636621795
36000003636619282
18000003636589359
6000003635567211
3000003618565297
600003579020081
300003530647170
100003441200362
50003111453349
25002922558300
1000602388210
50021659406
2503356102
100141136
50138111
25138111
10138111
5138111
1138111
0.5138111
leistio_request_bytes_bucket
+Inf3636832045
36000003636832045
18000003636831862
6000003636831301
3000003636831130
600003636827794
300003636825298
100003636801871
50003636738035
25003552367122
1000211
500124
25086
1000
500
250
100
50
10
0.50
leistio_response_bytes_bucket
+Inf3636862856
36000003636860343
18000003636830389
6000003635808151
3000003618805835
600003579257735
300003530881161
100003441429633
50003111665226
25002922736915
1000602396815
50021660279
2503356366
100141137
50138112
25138112
10138112
5138112
1138112
0.5138112

Event Timeline

We can surely come up with a new config to drop buckets that are not needed, but there is a caveat, namely that the annotation to use to reduce/customize the buckets (needs to be applied to all pods running istio) is available from 1.19 onward, and we are running 1.15.x.

- name: sidecar.istio.io/statsHistogramBuckets
  featureStatus: Alpha
  description: Specifies the custom histogram buckets with a prefix matcher to separate the Istio mesh metrics from the Envoy stats, e.g. {"istio":[1,5,10,50,100,500,1000,5000,10000],"envoy":[1,5,10,25,50,100,250,500,1000,2500,5000,10000]}. Default buckets are [0.5,1,5,10,25,50,100,250,500,1000,2500,5000,10000,30000,60000,300000,600000,1800000,3600000].

We are going to upgrade soon-ish when all the k8s clusters will be upgrade as well, but it may take some months. In the meantime, we could simply drop the unnecessary buckets via labeldrop, but we should verify it the procedure is sound.

The only concern I have with dropping metrics based on a given label is that the _sum and _count values will no longer reflect the actual buckets stored in the TSDB.
It might be pointless for us, but it's just something to keep in mind

The only concern I have with dropping metrics based on a given label is that the _sum and _count values will no longer reflect the actual buckets stored in the TSDB.
It might be pointless for us, but it's just something to keep in mind

@tappof definitely, we should try to figure out if we care about it or not. In my head _sum and _count would still make sense (namely lead to consistent results) even if we dropped some labels, but I am not sure if something could become inconsistent. Do you have anything on top of your head that could affect us?

I found https://istio.io/latest/docs/ops/common-problems/upgrade-issues/#use-a-proxy-annotation-to-customize-the-histogram-bucket-sizes that could be useful, namely applying an envoy filter in our Istio version. It is known that metrics and envoy filters sometimes don't play well (see https://github.com/istio/istio/issues/39772 for example) but for our use case it should be fine. I'd be inclined to test it on ml-staging to see how it goes.

Change #1143584 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: allow Istio gateways to customize histogram buckets

https://gerrit.wikimedia.org/r/1143584

Change #1144612 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] istio: introduce legacy images to backport features

https://gerrit.wikimedia.org/r/1144612

Change #1144612 abandoned by Elukey:

[operations/docker-images/production-images@master] istio: introduce legacy images to backport features

Reason:

Doesn't work sigh, more detailed explanation in the task.

https://gerrit.wikimedia.org/r/1144612

elukey changed the task status from Open to Stalled.May 13 2025, 1:05 PM

I tried the road of patching 1.15.7 in https://gerrit.wikimedia.org/r/1144612, but I kept ending up in:

pkg/bootstrap/config.go:250:52: undefined: annotation.SidecarStatsHistogramBuckets

The annotation namespace is introduced via istio.io/api/annotation and it holds the allowed annotations, including SidecarStatsHistogramBuckets. My understanding is that the error above is telling us that 1.15.7's version doesn't allow that annotation, the version imported should be probably bumped. I am not sure if it is possible, but it feels too much for a simple backport (together with the fact that https://github.com/istio/istio/pull/45368 didn't apply cleanly).

I think that we should wait for Istio 1.24 to be rolled out before proceeding.

Change #1143584 abandoned by Elukey:

[operations/deployment-charts@master] admin_ng: allow Istio gateways to customize histogram buckets

https://gerrit.wikimedia.org/r/1143584