Page MenuHomePhabricator

Investigate prometheus@k8s metric/label cardinality reduction
Closed, ResolvedPublic

Description

As outlined in the parent task, sometimes on k8s deploys the metric/label churn is so high that prometheus@k8s can get OOM killed, as a short term mitigation myself and @JMeybohm have been looking at low hanging fruits in terms of high cardinality metrics or labels we can drop safely

Event Timeline

Highest cardinality label is id which is used heavily by cadvisor and contains the slice id of the container (like: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod143a77d4_c340_47e0_966c_ae0c06b977b4.slice/docker-b1d6065dd088bb283a27edc9dd9478d0d61bac1ad7846e740f9b56ee80c4b7d5.scope). I've skimmed grafana and I don't think we use that label anywhere in k8s context. Metrics are usually matched by name (container) and pod_name

Change 989096 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::k8s: Drop id label from cadvisor metrics

https://gerrit.wikimedia.org/r/989096

Change 989096 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus::k8s: Drop id label from cadvisor metrics

https://gerrit.wikimedia.org/r/989096

Mentioned in SAL (#wikimedia-operations) [2024-01-09T10:54:20Z] <godog> restart prometheus@k8s on prometheus1005 to see if labeldrop id will yield expected results - T354604

While this did cut the cardinality for id in half it unfortunately did not really make any difference in terms of memory usage or appended samples per second (which I had expected). OTOH I would have also expected the cardinality to drop sharply, as there are only two other metrics (apart from the cadvisor stuff) that use the "id" label:
https://prometheus-eqiad.wikimedia.org/k8s/classic/graph?g0.range_input=1h&g0.expr=group%20(%7Bid!%3D%22%22%2C%20job!%3D%22k8s-node-cadvisor%22%7D)%20by%20(__name__%2C%20job)&g0.tab=1

Mentioned in SAL (#wikimedia-operations) [2024-01-09T15:54:33Z] <jayme> restart prometheus@k8s on prometheus1005 with GOGC=60 - T354604

Mentioned in SAL (#wikimedia-operations) [2024-01-09T16:26:57Z] <jayme> restart prometheus@k8s on prometheus1005 revert GOGC to 100 (default) - T354604

Change 989500 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus::k8s: Fix labeldrop actions

https://gerrit.wikimedia.org/r/989500

Change 989500 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus::k8s: Fix labeldrop actions

https://gerrit.wikimedia.org/r/989500

JMeybohm claimed this task.

With the fixed patch, head series where reduced and id is no longer the top cardinality label. I think we can resolve this