As outlined in the parent task, sometimes on k8s deploys the metric/label churn is so high that prometheus@k8s can get OOM killed, as a short term mitigation myself and @JMeybohm have been looking at low hanging fruits in terms of high cardinality metrics or labels we can drop safely
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T354399 Prometheus @ k8s OOM loop | |||
Resolved | JMeybohm | T354604 Investigate prometheus@k8s metric/label cardinality reduction |
Event Timeline
Highest cardinality label is id which is used heavily by cadvisor and contains the slice id of the container (like: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod143a77d4_c340_47e0_966c_ae0c06b977b4.slice/docker-b1d6065dd088bb283a27edc9dd9478d0d61bac1ad7846e740f9b56ee80c4b7d5.scope). I've skimmed grafana and I don't think we use that label anywhere in k8s context. Metrics are usually matched by name (container) and pod_name
Change 989096 had a related patch set uploaded (by JMeybohm; author: JMeybohm):
[operations/puppet@production] prometheus::k8s: Drop id label from cadvisor metrics
Change 989096 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus::k8s: Drop id label from cadvisor metrics
Mentioned in SAL (#wikimedia-operations) [2024-01-09T10:54:20Z] <godog> restart prometheus@k8s on prometheus1005 to see if labeldrop id will yield expected results - T354604
While this did cut the cardinality for id in half it unfortunately did not really make any difference in terms of memory usage or appended samples per second (which I had expected). OTOH I would have also expected the cardinality to drop sharply, as there are only two other metrics (apart from the cadvisor stuff) that use the "id" label:
https://prometheus-eqiad.wikimedia.org/k8s/classic/graph?g0.range_input=1h&g0.expr=group%20(%7Bid!%3D%22%22%2C%20job!%3D%22k8s-node-cadvisor%22%7D)%20by%20(__name__%2C%20job)&g0.tab=1
Mentioned in SAL (#wikimedia-operations) [2024-01-09T15:54:33Z] <jayme> restart prometheus@k8s on prometheus1005 with GOGC=60 - T354604
Mentioned in SAL (#wikimedia-operations) [2024-01-09T16:26:57Z] <jayme> restart prometheus@k8s on prometheus1005 revert GOGC to 100 (default) - T354604
Change 989500 had a related patch set uploaded (by JMeybohm; author: JMeybohm):
[operations/puppet@production] prometheus::k8s: Fix labeldrop actions
Change 989500 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus::k8s: Fix labeldrop actions
With the fixed patch, head series where reduced and id is no longer the top cardinality label. I think we can resolve this