Clean up/Consolidate kubernetes related dashboards
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JMeybohm
	Feb 24 2021, 3:53 PM

Description

The new kubernetes 1.16 cluster is missing some metrics that we use in dashboards:

https://grafana-rw.wikimedia.org/d/G8zPL7-Wz/kubernetes-node
- Rest Kubelet Client no data -> we lack rest_client_requests_* metrics from job="k8s-node"
- Kubelet HTTP Server no data -> we lack http_request_* metrics from job="k8s-node"
- kubeproxy HTTP Server no data -> we lack http_request_* metrics from job="k8s-node-proxy"
https://grafana-rw.wikimedia.org/d/000000472/kubernetes-staging-kubelets
- Kubelet response times no data -> we lack http_request_* metrics from job="k8s-node"

Other metrics from k8s-node and k8s-node-proxy are available (like the go default ones), so I assume name changes or something alike.

It seems that we have quite some dashboads/graphs that are duplicate and/or of no real use. We should consolidate what we actually want/use, generalize those dashboards (to be usable with any k8s cluster) and drop the rest.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		akosiaris	T244335 Upgrade kubernetes clusters to v1.16
		Resolved		JMeybohm	T275641 Clean up/Consolidate kubernetes related dashboards

Event Timeline

JMeybohm triaged this task as Medium priority.Feb 24 2021, 3:53 PM

JMeybohm created this task.

https://grafana-rw.wikimedia.org/d/G8zPL7-Wz/kubernetes-node
- http_request_*_seconds_* metrics from job="k8s-node" seems to have been refactored to kubelet_http_requests_*_seconds_* (as well as to histogram buckets)
- http_*_size_* metrics seem to have been dropped
- rest_client_requests_* metrics seem to have been dropped
- http_request_* metrics from job="k8s-node-proxy" seem to have been dropped altogether (but they are probably not that useful anyways as they'll just contain metric scrapes I guess)

Corresponding dashboard commit is: Update "Kubelet HTTP Server" row for k8s 1.16

https://grafana-rw.wikimedia.org/d/000000472/kubernetes-staging-kubelets
http_request_*_seconds_* metrics from job="k8s-node" seems to have been refactored to kubelet_http_requests_*_seconds_* (as well as to histogram buckets)

Corresponding dashboard commit is: Update "Kubelet response time" for k8s 1.16

Unfortunately the values from kubelet_http_requests_*_seconds_* do not really add up (they are way smaller in staging-codfw than in staging-eqiad) so maybe my assumption is wrong at they mean something completely different.

JMeybohm renamed this task from Investigate/Fix missing metrics from k8s-node and k8s-node-proxy jobs to Clean up/Consolidate kubernetes related dashboards.Mar 24 2021, 9:13 AM

JMeybohm updated the task description. (Show Details)

JMeybohm added a subscriber: elukey.

elukey awarded a token.Mar 24 2021, 1:01 PM

I 've deleted the 3 Staging specific dashboard and consolidated and made the rest more consistent. Now they all use a $cluster variable to denote the prometheus cluster they query, they all have a list of other dashboard tagged "Kubernetes" and "platform" (both need to apply for the grafana links filter). The overview dashboard lists some basic data like pods/nodes/api rps for all cluster, one per row.

I think we 've taken action to make this more consistent and presentable, allowing for multiple clusters in our environment, so it looks resolved on my side. There are a few metrics that have disappeared as @JMeybohm points out above without a replacement, not much we can do about aside from deleting the relevant panels.

I 'll be bold and resolve, but feel free to reopen

I see that all dashboards now include k8s-mlserve, thanks a lot!

Clean up/Consolidate kubernetes related dashboardsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Clean up/Consolidate kubernetes related dashboards
Closed, ResolvedPublic
Actions

Related Objects
Search...