Page MenuHomePhabricator

toolforge: Scrape Kubernetes controller-manager and apiserver metrics into Prometheus
Open, Needs TriagePublic

Description

All of the K8s control plane components emit Prometheus metrics, but unfortunately only the api server is currently scraped to the Toolforge Prometheus server.

One challenge is how to connect there. The API server is obviously exposed publicly, but the remaining components don't seem to be:

root@tools-k8s-control-2:~# ss -tulpn |grep kube-co
tcp     LISTEN   0        1024           127.0.0.1:10257          0.0.0.0:*      users:(("kube-controller",pid=2332,fd=7))
root@tools-k8s-control-2:~# curl -k https://localhost:10257/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}

root@tools-prometheus-03:~# curl -k --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt --key /etc/ssl/private/toolforge-k8s-prometheus.key https://k8s.tools.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/kube-system/pods/kube-controller-manager-tools-k8s-control-2:10257/proxy/metrics
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "error trying to reach service: dial tcp 172.16.0.93:10257: connect: connection refused",
  "reason": "ServiceUnavailable",
  "code": 503
}
root@tools-prometheus-03:~# curl -k --cert /etc/ssl/localcerts/toolforge-k8s-prometheus.crt --key /etc/ssl/private/toolforge-k8s-prometheus.key https://tools-k8s-control-2:10257/metrics
curl: (7) Failed to connect to tools-k8s-control-2 port 10257: Connection refused

Event Timeline

This might be useful: https://sysdig.com/blog/how-to-monitor-kube-controller-manager/
But it seems that it's expected to run prometheus from within the k8s cluster yes, so we might have to expose those ports using custom services or use some sort of agent/proxy.

Poking around to learn how this is handled in the production k8s clusters might be helpful? There are some teaser docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics. Those docs also point to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

Poking around to learn how this is handled in the production k8s clusters might be helpful? There are some teaser docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics. Those docs also point to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config

Not really :( The control plane setup is one of the key differences between our kubeadm setup and the production apt package based setup. Also controller-manager and scheduler are listed as "We don't scrape this component yet" on that page.

Useful reading:

If there's an easy way to get them to listen on all interfaces, let's do that, otherwise using a proxy seems like the best option.