toolforge: Scrape Kubernetes controller-manager and apiserver metrics into Prometheus
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	taavi
	May 14 2022, 10:51 AM

Description

All of the K8s control plane components emit Prometheus metrics, but unfortunately only the api server is currently scraped to the Toolforge Prometheus server.

One challenge is how to connect there. The API server is obviously exposed publicly, but the remaining components don't seem to be:

root@tools-k8s-control-2:~# ss -tulpn |grep kube-co
tcp     LISTEN   0        1024           127.0.0.1:10257          0.0.0.0:*      users:(("kube-controller",pid=2332,fd=7))
root@tools-k8s-control-2:~# curl -k https://localhost:10257/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}

root@tools-prometheus-03:~# curl -k --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt --key /etc/ssl/private/toolforge-k8s-prometheus.key https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/kube-system/pods/kube-controller-manager-tools-k8s-control-2:10257/proxy/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "error trying to reach service: dial tcp 172.16.0.93:10257: connect: connection refused",
  "reason": "ServiceUnavailable",
  "code": 503
}
root@tools-prometheus-03:~# curl -k --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt --key /etc/ssl/private/toolforge-k8s-prometheus.key https://tools-k8s-control-2:10257/metrics
curl: (7) Failed to connect to tools-k8s-control-2 port 10257: Connection refused

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T362869 Upgrade Toolforge to Uwubernetes
Open		None	T362868 Upgrade Toolforge Kubernetes to version 1.29
Open		None	T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Open		None	T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27
Open		None	T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26
Open		None	T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25
Resolved		aborrero	T307651 Upgrade Toolforge Kubernetes to version 1.24
Resolved		taavi	T298005 Upgrade Toolforge Kubernetes to version 1.23
Resolved		taavi	T286856 Upgrade Toolforge Kubernetes to latest 1.22
Resolved		rook	T308172 Upgrade PAWS to Kubernetes 1.21
Resolved		taavi	T282942 Upgrade Toolforge Kubernetes to latest 1.21
Resolved	BUG REPORT	None	T308189 Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade
Open		None	T308381 toolforge: Scrape Kubernetes controller-manager and apiserver metrics into Prometheus

Event Timeline

taavi created this task.May 14 2022, 10:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2022, 10:51 AM

taavi added a parent task: T308189: Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade.May 14 2022, 10:51 AM

dcaro added a project: User-dcaro.May 18 2022, 4:56 PM

This might be useful: https://sysdig.com/blog/how-to-monitor-kube-controller-manager/
But it seems that it's expected to run prometheus from within the k8s cluster yes, so we might have to expose those ports using custom services or use some sort of agent/proxy.

Poking around to learn how this is handled in the production k8s clusters might be helpful? There are some teaser docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics. Those docs also point to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

In T308381#7939420, @bd808 wrote:

Poking around to learn how this is handled in the production k8s clusters might be helpful? There are some teaser docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics. Those docs also point to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

Not really :( The control plane setup is one of the key differences between our kubeadm setup and the production apt package based setup. Also controller-manager and scheduler are listed as "We don't scrape this component yet" on that page.

Useful reading:

If there's an easy way to get them to listen on all interfaces, let's do that, otherwise using a proxy seems like the best option.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:39 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

toolforge: Scrape Kubernetes controller-manager and apiserver metrics into PrometheusOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

toolforge: Scrape Kubernetes controller-manager and apiserver metrics into Prometheus
Open, Needs TriagePublic
Actions

Related Objects
Search...