While investigating the parent task I noticed the default envoy histogram buckets are quite numerous, leading to cardinality explosion. Many of these buckets effectively approximate the +Inf bucket, meaning not many measurements fall into said buckets.
I checked envoy histograms and their cardinality, in Prometheus k8s eqiad for example:
count by (__name__) ({__name__=~'^envoy.*bucket'})
envoy_cluster_upstream_cx_connect_ms_bucket{} 622080
envoy_cluster_upstream_cx_length_ms_bucket{} 565180
envoy_cluster_upstream_rq_time_bucket{} 586580
envoy_http_downstream_cx_length_ms_bucket{} 591800
envoy_http_downstream_rq_time_bucket{} 619460
envoy_listener_manager_lds_update_duration_bucket{} 5020
envoy_server_initialization_time_ms_bucket{} 5020
envoy_cluster_manager_cds_update_duration_bucket{} 5020Findings
From the data below I found the following:
- 0.5 and 1 buckets report the same information, thus we can ditch 0.5 for example
- for upstream and downstream rq_time histograms there is very little information above the 10000 bucket
- upstream_cx_connect_ms has very little information above the 1000 bucket
- conversely, upstream_cx_length_ms has little information below the 1000 bucket
With the above in mind, I think the following configuration will work to reduce cardinality and focus histograms:
stats_config: histogram_bucket_settings: - match: safe_regex: regex: ".+rq_time$" buckets: [ 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500 ] - match: safe_regex: regex: ".+upstream_cx_connect_ms$" buckets: [ 1, 5, 10, 25, 50, 100, 250, 500, 1000 ] - match: safe_regex: regex: ".+(upstream|downstream)_cx_length_ms$" buckets: [ 2500, 5000, 10000, 30000, 60000, 300000 ] # remove 0.5, 1 and > 60000 default buckets - match: safe_regex: regex: ".+" buckets: [ 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000 ]
Histogram breakdown
Below the breakdown per-histogram of the highest cardinality
envoy_cluster_upstream_cx_connect_ms_bucket
| Latency (ms) | Request Count | Ratio |
|---|---|---|
| +Inf | 3011861835 | 1.0000000000 |
| 3600000 | 3011861835 | 1.0000000000 |
| 1800000 | 3011861835 | 1.0000000000 |
| 600000 | 3011861835 | 1.0000000000 |
| 300000 | 3011861835 | 1.0000000000 |
| 60000 | 3011861835 | 1.0000000000 |
| 30000 | 3011861835 | 1.0000000000 |
| 10000 | 3011861835 | 1.0000000000 |
| 5000 | 3011861823 | 0.9999999960 |
| 2500 | 3011861662 | 0.9999999426 |
| 1000 | 3011108430 | 0.9997501057 |
| 500 | 3001119537 | 0.9964332092 |
| 250 | 2969494809 | 0.9859326052 |
| 100 | 2916965382 | 0.9684902711 |
| 50 | 2886038625 | 0.9582262723 |
| 25 | 2856777645 | 0.9485094738 |
| 10 | 2793533099 | 0.9275073595 |
| 5 | 1752232402 | 0.5817812161 |
| 1 | 1737413885 | 0.5768578642 |
| 0.5 | 1737413885 | 0.5768578642 |
envoy_cluster_upstream_cx_length_ms_bucket
| Latency (ms) | Request Count | Ratio |
|---|---|---|
| +Inf | 3032951969 | 1.0000000000 |
| 3600000 | 3032523095 | 0.9998586896 |
| 1800000 | 3031804137 | 0.9996215352 |
| 600000 | 3030982136 | 0.9993507080 |
| 300000 | 3026391966 | 0.9978413617 |
| 60000 | 2933882828 | 0.9673003397 |
| 30000 | 2785005415 | 0.9183137903 |
| 10000 | 2214836590 | 0.7302582329 |
| 5000 | 1520058549 | 0.5011489346 |
| 2500 | 188065333 | 0.0619743750 |
| 1000 | 27725493 | 0.0091414399 |
| 500 | 19517999 | 0.0064353164 |
| 250 | 7895422 | 0.0026032071 |
| 100 | 2278798 | 0.0007513786 |
| 50 | 1844283 | 0.0006080478 |
| 25 | 1797510 | 0.0005926940 |
| 10 | 1662779 | 0.0005482100 |
| 5 | 1446792 | 0.0004769258 |
| 1 | 223694 | 0.0000737545 |
| 0.5 | 223694 | 0.0000737545 |
envoy_cluster_upstream_rq_time_bucket
| Latency (ms) | Request Count | Ratio |
|---|---|---|
| +Inf | 80392131818 | 1.0000000000 |
| 3600000 | 80392131066 | 0.9999999991 |
| 1800000 | 80392130579 | 0.9999999985 |
| 600000 | 80392129243 | 0.9999999968 |
| 300000 | 80392127507 | 0.9999999946 |
| 60000 | 80392088986 | 0.9999994648 |
| 30000 | 80391895161 | 0.9999970642 |
| 10000 | 80388616732 | 0.9999563080 |
| 5000 | 80378958355 | 0.9998361054 |
| 2500 | 80341439503 | 0.9993691761 |
| 1000 | 80119149080 | 0.9966097780 |
| 500 | 79595383747 | 0.9901766401 |
| 250 | 78766328583 | 0.9797579732 |
| 100 | 76486896678 | 0.9514031363 |
| 50 | 71535580313 | 0.8899342089 |
| 25 | 67605427983 | 0.8409297654 |
| 10 | 66385070736 | 0.8257381617 |
| 5 | 65546296277 | 0.8152977932 |
| 1 | 43557423366 | 0.5418004934 |
| 0.5 | 43557423366 | 0.5418004934 |
envoy_http_downstream_cx_length_ms_bucket
| Latency (ms) | Request Count | Ratio |
|---|---|---|
| +Inf | 11779070899 | 1.0000000000 |
| 3600000 | 11735809866 | 0.9963252033 |
| 1800000 | 11731264483 | 0.9959407196 |
| 600000 | 11715336486 | 0.9945872097 |
| 300000 | 11685298726 | 0.9920405507 |
| 60000 | 11508039606 | 0.9769795744 |
| 30000 | 11428418229 | 0.9702184275 |
| 10000 | 11144132203 | 0.9460779732 |
| 5000 | 10714662661 | 0.9096277764 |
| 2500 | 9514899555 | 0.8078559026 |
| 1000 | 9431764624 | 0.8007290341 |
| 500 | 9273182328 | 0.7872546641 |
| 250 | 8979839772 | 0.7624395634 |
| 100 | 8192578011 | 0.6955141264 |
| 50 | 5837174971 | 0.4955486309 |
| 25 | 3968125599 | 0.3368799848 |
| 10 | 3689856264 | 0.3132552456 |
| 5 | 2802703988 | 0.2379402818 |
| 1 | 181311654 | 0.0153926906 |
| 0.5 | 181311654 | 0.0153926906 |
envoy_http_downstream_rq_time_bucket
| Latency (ms) | Request Count | Ratio |
|---|---|---|
| +Inf | 62171609495 | 1.0000000000 |
| 3600000 | 62171608977 | 0.9999999917 |
| 1800000 | 62171608618 | 0.9999999859 |
| 600000 | 62171607612 | 0.9999999698 |
| 300000 | 62171606308 | 0.9999999489 |
| 60000 | 62171497419 | 0.9999820152 |
| 30000 | 62171278301 | 0.9999468669 |
| 10000 | 62169089736 | 0.9999594069 |
| 5000 | 62163495180 | 0.9998704369 |
| 2500 | 62140414940 | 0.9994966057 |
| 1000 | 62035284852 | 0.9978033271 |
| 500 | 61786742772 | 0.9938264152 |
| 250 | 61411014287 | 0.9877962865 |
| 100 | 60459796988 | 0.9724780371 |
| 50 | 57876099816 | 0.9310401889 |
| 25 | 55692930819 | 0.8957894398 |
| 10 | 54005280155 | 0.8687450717 |
| 5 | 53498624624 | 0.8605022275 |
| 1 | 32297147862 | 0.5195487733 |
| 0.5 | 32297147862 | 0.5195487733 |