Page MenuHomePhabricator

Strategy for Envoy metrics and Prometheus
Open, Needs TriagePublic

Description

As it became painfully obvious in https://phabricator.wikimedia.org/T354399#9615108 the increase in samples/s ingested for Prometheus k8s has accelerated at a pace I did not anticipate (~2.5x in ~2 months). Prometheus itself seems to be fine with the current load, though it does enter a crashloop on restart due to OOM while replaying its WAL. Truncating the WAL works as a mitigation, though of course it isn't ideal due to data loss (upstream issue)

Upon quick investigation at https://prometheus-codfw.wikimedia.org/k8s/tsdb-status as of today the biggest metrics are from Envoy:

Top 10 series count by metric names
Name	Count
envoy_http_downstream_cx_length_ms_bucket	681580
envoy_http_downstream_rq_time_bucket	681580
envoy_cluster_upstream_cx_length_ms_bucket	663000
envoy_cluster_upstream_cx_connect_ms_bucket	663000
envoy_cluster_upstream_rq_time_bucket	293980
istio_request_bytes_bucket	215420
istio_request_duration_milliseconds_bucket	215420
istio_response_bytes_bucket	215420
envoy_http_downstream_rq_xx	170395
container_blkio_device_usage_total	72556

This task is to brainstorm short/medium term ideas on how to address the issue, some venues for (possibly concurrent) investigation:

  • Work on the Prometheus WAL replay memory explosion (e.g. capture memory profiles, work with upstream, etc)
  • Ingest less metrics, for example test scraping from /stats/prometheus?usedonly as per upstream docs
  • Scale up (memory wise) Prometheus hosts

Whereas the longer term and "nail in the coffin" type of solution is scale Prometheus horizonally, e.g. via sharding

Event Timeline

Just to create the reference, I would assume this is a consequence of T290536: Serve production traffic via Kubernetes and friends.

Change 1012995 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: scrape envoy on k8s metrics with 'usedonly'

https://gerrit.wikimedia.org/r/1012995

Change #1012995 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: scrape envoy on k8s metrics with 'usedonly'

https://gerrit.wikimedia.org/r/1012995

Change #1013515 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: scrape envoy on k8s metrics with 'usedonly' (take #2)

https://gerrit.wikimedia.org/r/1013515

Change #1013515 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: scrape envoy on k8s metrics with 'usedonly' (take #2)

https://gerrit.wikimedia.org/r/1013515

Promising results, samples/s in eqiad went from ~200k/s to ~110k/s after the change (and slightly increasing)

2024-03-25-105141_662x562_scrot.png (562×662 px, 27 KB)

I'll keep the situation monitored during the next few days and see what the pattern looks like

Change #1016786 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: disable pint promql/series for EnvoyRuntimeAdminOverrides

https://gerrit.wikimedia.org/r/1016786

Change #1016786 merged by Filippo Giunchedi:

[operations/alerts@master] sre: disable pint promql/series for EnvoyRuntimeAdminOverrides

https://gerrit.wikimedia.org/r/1016786