As it became painfully obvious in https://phabricator.wikimedia.org/T354399#9615108 the increase in samples/s ingested for Prometheus k8s has accelerated at a pace I did not anticipate (~2.5x in ~2 months). Prometheus itself seems to be fine with the current load, though it does enter a crashloop on restart due to OOM while replaying its WAL. Truncating the WAL works as a mitigation, though of course it isn't ideal due to data loss (upstream issue)
Upon quick investigation at https://prometheus-codfw.wikimedia.org/k8s/tsdb-status as of today the biggest metrics are from Envoy:
Top 10 series count by metric names Name Count envoy_http_downstream_cx_length_ms_bucket 681580 envoy_http_downstream_rq_time_bucket 681580 envoy_cluster_upstream_cx_length_ms_bucket 663000 envoy_cluster_upstream_cx_connect_ms_bucket 663000 envoy_cluster_upstream_rq_time_bucket 293980 istio_request_bytes_bucket 215420 istio_request_duration_milliseconds_bucket 215420 istio_response_bytes_bucket 215420 envoy_http_downstream_rq_xx 170395 container_blkio_device_usage_total 72556
This task is to brainstorm short/medium term ideas on how to address the issue, some venues for (possibly concurrent) investigation:
- Work on the Prometheus WAL replay memory explosion (e.g. capture memory profiles, work with upstream, etc)
- Ingest less metrics, for example test scraping from /stats/prometheus?usedonly as per upstream docs
- Scale up (memory wise) Prometheus hosts
Whereas the longer term and "nail in the coffin" type of solution is scale Prometheus horizonally, e.g. via sharding
