Page MenuHomePhabricator

Improve monitoring of the Kubernetes clusters
Closed, ResolvedPublic

Description

Objective

We currently have minimal monitoring of our production kubernetes clusters. Starting with 1.8 the /metrics API exists which expose CPU and memory usage. There also exists a nice aggregator starting from 1.7 called metrics server [1] that can be used as well (or in tandem). And there also exists heapster [2], an aggregator for monitoring and event data. We should investigate these solutions (and others perhaps), pick one (or more), implement and obtain graphs for our kubernetes clusters

[1] https://github.com/kubernetes-incubator/metrics-server
[2] https://github.com/kubernetes/heapster

Preamble

metrics collecting/exposing is in a state of flux in kubernetes and already has some history to deal with. Here's a description of various components/notions in no specific

cAdvisor

Project page is https://github.com/google/cadvisor

So this is a nice little go binary that runs alongside your containers on your host (as a container or a standalone daemon), usually as root (but not strictly required, metrics will just don't appear if no root privileges) and starts looking at cgroups and getting data out of them. Of course it can run as a docker container and in that case can actually query the docker daemon (it is meant to bind-mount /var/lib/docker) about stuff and expose them. It does support all container engines supposedly and bugs reports should be opened for any non supported one.

It exposes an HTML based interface [1] and a REST API interface [2] . It supports a variety of sinks to send data to [3] or allows them to consume from it (e.g. prometheus)

cAdvisor, being a simple go project was imported into the kubernetes project and was builtin into kubelet (1 of the 2 kubernetes daemons running on every node). So every kubelet listens on port 4194 (unless disabled) and exposes the /containers endpoint (an HTML web page) and /metrics (a prometheus compatible endpoint).

We have absolutely no reason to care about cAdvisor itself as we get pretty much the full functionality of it via the kubelet. We could read the docs and implement something to talk it it's API, but IMHO this does not make much sense.

The API server (the one running on the master) exposes that information via a proxying model

Starting with kubernetes 1.7.3 (we are running 1.7.4) the information from the /metrics endpoint is transparently split into 2 endpoints. Those are:

* api/v1/nodes/<node_name>/proxy/metrics
* api/v1/nodes/<node_name>/proxy/metrics/cadvisor

For 1.7.0-1.7.2 (we don't really care, but adding it for completeness sake) only api/v1/nodes/<node_name>/proxy/metrics was exposed.

For 1.6 and earlier (production doesn't care but labs may do), both endpoints did exist but the former duplicated part of the information from the latter.

Note: The lack of trailing slashes is unfortunately important, specifying one means a 404 will be returned instead

We should have a prometheus configuration that scrapes those 2 endpoints per node, which means we will need some discovery mechanism. Prometheus does look like it has one [4]

API server and controller metrics

Since kubernetes 1.0 (at the least) the /metrics endpoint on the API server would expose metrics about itself and the controller manager daemons. It's in prometheus format and has been stable for quite a while.

We should have a prometheus configuration that scrapes this endpoint. It's just one endpoint and should be easy to do

This is not to be confused with kube-state-metrics (https://github.com/kubernetes/kube-state-metrics) or the /metrics endpoint of kubelet

Heapster

Project page at https://github.com/kubernetes/heapster

Heapster is a effectively a collector. It runs in the cluster (or as a standalone daemon outside it), polls multiple sources (practically always just 1, the kubernetes API server) and then sends the data it gathered from it to a sink (and kubelets cAdvisor API) as well as exposing it it via a REST API [5] albeit for a limited period of time as the data is in memory. Many different sinks type are supported [6]. The example setup seems to be influxdb with a grafana frontend.

The data exposed by heapster is well structured [7]

Overall heapster, serves 2 functions. One is that it is a translator/collector from cAdvisor to one of the sink types. Second it's an grouping (is that a good term?) REST API in order to avoid having to talk to the REST cAdvisor API of every kubelet. The API it exposes is/can be used by the Horizontal Pod Autoscaler and the scheduler

metrics-server

Project page at https://github.com/kubernetes-incubator/metrics-server

The metrics-server is effectively an effort to standardize the second part of heapster's functionality, namely the grouped/aggregated REST API. It effectively is an API exposing an in-memory datastore of the grouped REST cAdvisor API of every kubelet. It is still in beta, still in incubator in fact, and under development. The API it exposes is meant to be used by the Horizontal Pod Autoscaler and scheduler in the future and is going to be built-in and running by default (it already is in clusters brought up by kube-up.sh)

kube-state-metrics

Project page at https://github.com/kubernetes/kube-state-metrics

This is a relatively new project (Started on May 2016). It differs from all the other stuff up to now as it is supposed to be a simple service that basically aggregates the API servers metrics and exposes state by object type. Object types are grouped [8]. It exposes a prometheus compatible /metrics API endpoint (NOT TO BE CONFUSED with the same endpoint by the apiserver nor the kubelet, they are disjoint).

Overall this looks like something we could use at some point in time, but it's not immediately required. It provides metrics for high level overview of the state of the clusters. It is designed to become a source for heapster/metrics-server at some point in time.

Kubernetes Monitoring architecture

The overall (convoluted and difficult to understand IMHO) document [9]

It describes in, not so great clarity, 2 monitoring metrics pipelines.

  • core metrics pipeline: It's Kubelets+metrics server+apiserver. This is the core, it's all about being used by core system stuff like scheduler and horizontal pod autoscaler. It should also be used by simple monitoring tools.

It also divides metrics into system and service. service are explicitly defined in application code and exported by it. System are the generic ones that are available from every monitored entity (CPU, memory, IO etc). Practically anything not a service metrics is considered a system metric.

System metrics are subdivided in core and non-core metrics and this is where the document is ultra confusing as non-core include core. This needs a bit more reading.

One worrying part is the comment "Kubelet, providing per-node/pod/container usage information (the current cAdvisor that is part of Kubelet will be slimmed down to provide only core system metrics)", not sure what it means yet, seems like this was implemented in 1.7.0 and later partly reverted (in 1.7.4)

Overall the point of the document is to define the 2 pipelines and say kubernetes will try to provide a good implementation of the 1st one while leaving the second to interested parties. Looks like that's the part we will have to implement, using prometheus.

[1] https://github.com/google/cadvisor/blob/master/docs/web.md
[2] https://github.com/google/cadvisor/blob/master/docs/api.md
[3] https://github.com/google/cadvisor/blob/master/docs/storage/README.md
[4] https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml
[5] https://github.com/kubernetes/heapster/blob/master/docs/model.md
[6] https://github.com/kubernetes/heapster/blob/master/docs/sink-owners.md
[7] https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md
[8] https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation
[9] https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/monitoring_architecture.md

Event Timeline

akosiaris triaged this task as Medium priority.Oct 27 2017, 10:59 AM

Looks like a valid path forward is to expose the 3 /metrics endpoints, that is the API server's and the kubelet's via the apiserver's proxying. https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml might very well come in handy. I 'll work on figuring out the authn/authz part of this so we can create a working prometheus account and start polling. That work is tracked in T177393

akosiaris added a project: observability.
akosiaris moved this task from Inbox to In progress on the observability board.

Change 387551 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add k8s instance

https://gerrit.wikimedia.org/r/387551

Change 387551 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add k8s instance

https://gerrit.wikimedia.org/r/387551

Change 388505 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add jobs to scrape metrics from k8s

https://gerrit.wikimedia.org/r/388505

Change 388505 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add jobs to scrape metrics from k8s

https://gerrit.wikimedia.org/r/388505

Change 389929 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: Prometheus https access to k8s apiserver / node

https://gerrit.wikimedia.org/r/389929

Change 389930 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] profile: allow Prometheus to access k8s kubelet

https://gerrit.wikimedia.org/r/389930

I gave k8s discovery for Prometheus a try, the first blocker is that the Debian version of Prometheus doesn't include k8s discovery. The reason is the huge number of dependencies pulled in by the k8s client that would need to be packaged and maintained in Debian. I've temporarily switched the prometheus@k8s instance on prometheus1003 to stock Prometheus 1.8.2 to get unblocked. Once k8s discovery was available I was able to poll both apiserver and kubelet (modulo https://gerrit.wikimedia.org/r/389930 and https://gerrit.wikimedia.org/r/389929).

Change 389929 merged by Filippo Giunchedi:
[operations/puppet@production] role: Prometheus https access to k8s apiserver / node

https://gerrit.wikimedia.org/r/389929

Change 390264 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] prometheus: Force using read-only kubelet API

https://gerrit.wikimedia.org/r/390264

I've built a k8s-enabled deb from Debian package and imported the repo in operations/debs/prometheus. I'll test and upload the new package (1.8.1+ds+k8s-1) and then upgrade the production Prometheus.

Change 390264 merged by Alexandros Kosiaris:
[operations/puppet@production] prometheus: Force using read-only kubelet API

https://gerrit.wikimedia.org/r/390264

Change 389930 merged by Alexandros Kosiaris:
[operations/puppet@production] profile: allow Prometheus to access k8s kubelet

https://gerrit.wikimedia.org/r/389930

Mentioned in SAL (#wikimedia-operations) [2017-11-13T09:44:31Z] <godog> test upgrade of prometheus 1.8.1 with k8s on prometheus2003 - T177395

Mentioned in SAL (#wikimedia-operations) [2017-11-14T09:43:35Z] <godog> upgrade prometheus to 1.8.1 with k8s on prometheus2004 - T177395

Mentioned in SAL (#wikimedia-operations) [2017-11-16T11:21:30Z] <godog> upgrade prometheus to 1.8.1+ds+k8s-1 in ulsfo/esams/eqiad - T177395

Change 397546 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add 3 prometheus checks for kubernetes

https://gerrit.wikimedia.org/r/397546

Change 397552 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add kubelet operational latencies check

https://gerrit.wikimedia.org/r/397552

Change 397546 merged by Alexandros Kosiaris:
[operations/puppet@production] Add 3 prometheus checks for kubernetes

https://gerrit.wikimedia.org/r/397546

Change 397552 merged by Alexandros Kosiaris:
[operations/puppet@production] Add kubelet operational latencies check

https://gerrit.wikimedia.org/r/397552

Change 397794 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Escape the exclamation mark in icinga k8s master checks

https://gerrit.wikimedia.org/r/397794

Change 397794 merged by Alexandros Kosiaris:
[operations/puppet@production] Escape the exclamation mark in icinga k8s master checks

https://gerrit.wikimedia.org/r/397794

And with the above merged, this is resolved.