Page MenuHomePhabricator

Kubernetes 1.16 dropped deprecated cadvisor metric labels pod_name and container_name
Closed, ResolvedPublic

Description

With kubernetes 1.16 the cadvisor metric labels pod_name and container_name have been dropped (deprecated since 1.14) in favor of pod and container.

We have several grafana dashboards relying on those labels, like:

Going forward we will probably have to fix all dashboards to use the new labels but we will have different versions running in production for some time which would mean dashboards only work for the new or the old clusters (and we probably want to compare things between them).

So I suggest we add a rewrite rule to prometheus to duplicate the pod_name and container_name labels into the new ones and patch all dashboards after we have ingested a reasonable amount of history.

Upstream issue: https://github.com/kubernetes/kubernetes/pull/80376

Event Timeline

JMeybohm triaged this task as Medium priority.Feb 24 2021, 1:50 PM
JMeybohm created this task.

I 'll try and draft out a possible way out of this. I am adding members of the observability team for input.

So the crux of the issue is that queries like this

sum(rate(container_cpu_user_seconds_total{job="k8s-node-cadvisor", namespace="$service", site="$site", prometheus="$prometheus", pod_name=~"$service.*", container_name=~"$service.*"}[5m]))

will not return results for the cluster that are 1.16+ cause pod_name and container_name doesn't exist anymore. The proposal I got is the following:

  • Fetch the JSON for all the dashboards under the Service/ and Kubernetes/ grafana folders.
  • Programmatically parse the JSON, and whenever a query like the above is met add an extra query that will just have pod instead of pod_name and container instead of container_name.
  • Save the new JSON
  • Upload it to grafana (possibly manually but it's not that big a deal, it's like 40 dashboards or so).

This will double the requests to prometheus/thanks for the migration period (which should last roughly as much time as thanos will keep the old data around).

How does the above sound?

The proposal I got is the following:

  • Fetch the JSON for all the dashboards under the Service/ and Kubernetes/ grafana folders.
  • Programmatically parse the JSON, and whenever a query like the above is met add an extra query that will just have pod instead of pod_name and container instead of container_name.
  • Save the new JSON
  • Upload it to grafana (possibly manually but it's not that big a deal, it's like 40 dashboards or so).

This solution is what we ended up doing for nearly all node exporter metrics in T213708. To this day, nearly every panel on the Host Overview dashboard has an or clause.

The proposed solution SGTM.

The proposal I got is the following:

  • Fetch the JSON for all the dashboards under the Service/ and Kubernetes/ grafana folders.
  • Programmatically parse the JSON, and whenever a query like the above is met add an extra query that will just have pod instead of pod_name and container instead of container_name.
  • Save the new JSON
  • Upload it to grafana (possibly manually but it's not that big a deal, it's like 40 dashboards or so).

This solution is what we ended up doing for nearly all node exporter metrics in T213708. To this day, nearly every panel on the Host Overview dashboard has an or clause.

Yup, I must admit publicly that this trick was my inspiration for this proposal. But I had not thought of the or clause that's actually neat.

The proposed solution SGTM.

Many thanks!

Just chiming in to say that FWIW I concur with the or usage trick !

Using the python grafcli package I fetched all the dashboards under the Service/ folder, then sent them through the attached python script that did a number of regex substitutions.

The resulting JSONs I 've manually imported them checking dashboards. Most things are working just fine. There's a couple of issues here and there.

  • In legends that used to be in the form blah {{ pod_name }} and now are blah {{ pod }} we no longer have the actual pod name in the old clusters. That's fine since we will soon be switching to the new kubernetes version the production clusters and that level of granularity will not be required for long.
  • In the template variable container_name that is used in some dashboards. Rows using it (e.g. Saturation for $container_name row) it only works for the new scheme as I found no way to convince grafana (and it's (label_values() function) to populate data from 2 different labels (even for the same metric) nor have I found a way to concatenate them. Again, this should be fine, same reasoning as above.

Run the same thing on the Kubernetes/ folder. I only had to update 3 dashboards

  • Kubernetes DNS
  • Kubernetes Staging pods
  • Kubernetes pods

All other dashboards were a no diff.

akosiaris claimed this task.

I am gonna resolve this, I think we 've updated all we cared for and in most cases (baring the ones mentioned above in the comment) and thus are ready to proceed with migrating the production clusters too.