Page MenuHomePhabricator

Clean up/Consolidate kubernetes related dashboards
Closed, ResolvedPublic

Description

The new kubernetes 1.16 cluster is missing some metrics that we use in dashboards:

Other metrics from k8s-node and k8s-node-proxy are available (like the go default ones), so I assume name changes or something alike.

It seems that we have quite some dashboads/graphs that are duplicate and/or of no real use. We should consolidate what we actually want/use, generalize those dashboards (to be usable with any k8s cluster) and drop the rest.

Event Timeline

JMeybohm triaged this task as Medium priority.Feb 24 2021, 3:53 PM
JMeybohm created this task.
  • https://grafana-rw.wikimedia.org/d/G8zPL7-Wz/kubernetes-node
    • http_request_*_seconds_* metrics from job="k8s-node" seems to have been refactored to kubelet_http_requests_*_seconds_* (as well as to histogram buckets)
    • http_*_size_* metrics seem to have been dropped
    • rest_client_requests_* metrics seem to have been dropped
    • http_request_* metrics from job="k8s-node-proxy" seem to have been dropped altogether (but they are probably not that useful anyways as they'll just contain metric scrapes I guess)

Corresponding dashboard commit is: Update "Kubelet HTTP Server" row for k8s 1.16

Corresponding dashboard commit is: Update "Kubelet response time" for k8s 1.16

Unfortunately the values from kubelet_http_requests_*_seconds_* do not really add up (they are way smaller in staging-codfw than in staging-eqiad) so maybe my assumption is wrong at they mean something completely different.

JMeybohm renamed this task from Investigate/Fix missing metrics from k8s-node and k8s-node-proxy jobs to Clean up/Consolidate kubernetes related dashboards.Mar 24 2021, 9:13 AM
JMeybohm updated the task description. (Show Details)
JMeybohm added a subscriber: elukey.

I 've deleted the 3 Staging specific dashboard and consolidated and made the rest more consistent. Now they all use a $cluster variable to denote the prometheus cluster they query, they all have a list of other dashboard tagged "Kubernetes" and "platform" (both need to apply for the grafana links filter). The overview dashboard lists some basic data like pods/nodes/api rps for all cluster, one per row.

I think we 've taken action to make this more consistent and presentable, allowing for multiple clusters in our environment, so it looks resolved on my side. There are a few metrics that have disappeared as @JMeybohm points out above without a replacement, not much we can do about aside from deleting the relevant panels.

I 'll be bold and resolve, but feel free to reopen

I see that all dashboards now include k8s-mlserve, thanks a lot!