Page MenuHomePhabricator

'helm_release_status' metric not found in k8s-mlserve
Closed, ResolvedPublic

Description

I'm going through all the pint errors and the following error is reported (link to AlertLintProblem)

Pint reporter promql/series found problem(s) in /srv/alerts/k8s-mlserve/team-sre_kubernetes-generic.yaml: prometheus "k8s-mlserve" at http://127.0.0.1:9909/k8s-mlserve didn't have any series for "helm_release_status" metric in the last 1w

After a chat with @elukey he pointed me to T323706: wikikube LIST secrets latency which introduced the metric; then the current question I have is whether it is expected for k8s-mlserve to not carry the metric, or we could expand helm status metrics to run on all k8s cluster, or some other solution I can't see right now? What do you think @JMeybohm ? Thank you!

Event Timeline

There is no reason not to deploy helm-state-metrics to all clusters, I just left that as an decision for the cluster operators to make rather than deploying it everywhere.

There is no reason not to deploy helm-state-metrics to all clusters, I just left that as an decision for the cluster operators to make rather than deploying it everywhere.

Thank you, that makes sense! Do you see any problem with deploying helm-state-metrics to all clusters? Alternatively we could select specific instances via deploy-tag, though that might be more brittle (?) Finally we could ignore the "missing metrics" lint error too

I don't see any problem deploying it to all clusters. @BTullis, @jhathaway wdyt?
For reference: This is what we're talking about: https://gerrit.wikimedia.org/g/operations/software/helm-state-metrics

Change 922474 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Install helm-state-metrics by default on all clusters

https://gerrit.wikimedia.org/r/922474

Change 922474 merged by jenkins-bot:

[operations/deployment-charts@master] Install helm-state-metrics by default on all clusters

https://gerrit.wikimedia.org/r/922474

I've deployed helm-state-metrics to all clusters and the LintProblem alerts are gone.
Here is a link to the helm releases dashboard for the record: https://grafana.wikimedia.org/d/UT4GtK3nz/helm-releases

JMeybohm claimed this task.