Page MenuHomePhabricator

Design pod-level monitoring and service-level alerting
Closed, ResolvedPublic

Description

While pods are routinely monitored via health checks by kubernetes, we still don't have pod or service level monitoring. While exposed services can clearly be monitored using our standard icinga based tools, pods are not so easy, especially given their ephemeral nature. So are non exposed services as they are not reachable by anything outside the cluster. This task is about designing and perhaps implementing how we will be doing that

So, we have ended up already implementing some stuff. That was kind of expected given there has been a related effort in T177395. Note that the usual RFC conventions of MUST, MUST NOT, SHOULD, SHOULD NOT are used below.

Now to recap what we have implemented and designed.

Pods

Metrics at https://grafana-admin.wikimedia.org/dashboard/db/kubernetes-pods?orgId=1. Those cover

Metrics

  • CPU
  • Memory
  • Number of containers/pod
  • IOPS
  • Pod execution latency
  • container lifetime

Alerts

Specific alerts haven't been yet created on these as we have no experience yet running services under kubernetes. We will wait some time to spot out the abnormalities and create alerts then.

Probes

Probes are arbitrary short in duration checks that happen in the context of container and are connected with an action taken by kubernetes. 2 kinds of probes currently exist, liveness and readiness.

Liveness

Containers will be auto restarted by kubernetes if they fail a basic liveness probe. In https://gerrit.wikimedia.org/r/#/c/392619/4/_scaffold/values.yaml the basic liveness probe is an HTTP GET request to /

The URL part can be overriden on a per service basis and it is expected the services using service-runner will define /info or /?spec. An endpoint that can be used as a liveness probe MUST exist

Readiness

If a pod fails a readiness probe no traffic will be directed to it until it stops failing that probe. This allows a probe to inform kubernetes it is overwhelmed and traffic should be directed elsewhere.

In https://gerrit.wikimedia.org/r/#/c/392619/4/_scaffold/values.yaml the basic readiness probe is an HTTP GET request to /

The URL part can be overriden on a per service basis and it is expected the services using service-runner will define /info or /?spec. An endpoint that can be used as a liveness probe MUST exist. It's fine if that endpoint is the same as the liveness endpoint. Services that are able of knowing when they are overloaded however SHOULD create and specify a readiness endpoint

Services

Kubernetes services can be exposed via a variety of way. In our environment after some discussions we decided that for now we will standardize on NodePort type services.

Below is a quick recap of kubernetes service types and how they function and their usage in WMF and how we intend to monitor

ClusterIP

Those are services that are meant to only exist intra-cluster. Those are useful if practically all callers of a service are only going to be in the kubernetes cluster. At least for now we won't be having those types of services as they are a) not easily monitored outside the cluster, b) requiring a critical mass of services in the kubernetes cluster. Service owners should NOT be asking for this type of service

NodePort

This is the type of services we will be going with. Effectively for every service of that type, a port is chosen on every node and every node uses it to publish the service. We will be using an LVS IP (every node will have all LVS IPs same as for other stuff in our infrastructure) for every service in order to both decouple services from IP couplings and avoid port conflicts.

We expect to leverage the standard icinga infrastructure we currently have, using service-checker to monitor every such service in the standard way we monitor all non kubernetes based services. That will maintain the status quo and allow us to move forward without extra disruption. All services SHOULD partially or fully conform to the service-checker contract

The aforementioned contract is already implemented, but it would be nice to fully document it.

Headless services

Those are services that have no serviceIP assigned [1]. We won't alert on failures of such services. Their usage will be actively discouraged. Services owners SHOULD NOT ask for this type of service for now

LoadBalancer

Very specific type of service, tightly bound a cloud provider's load balancer (e.g. ELB). We will NOT be having this kind of service at all, ever due to technical limitations (it was not designed for bare metal use)

ExternalName

Practically a service that is a CNAME. Useful for internal service discovery. We will NOT be having this kind of service at all, at least for now

[1] https://kubernetes.io/docs/concepts/services-networking/service/#headless-services