Page MenuHomePhabricator

[k8s,infra,alerting] improve HAproxy and k8s apiserver interaction
Closed, ResolvedPublic

Description

In incident T367348: Incident: 2024-06-12 toolforge k8s control plane we discovered a failure case where the Kubernetes API server's port is open but it's not responding to any API requests. In that case HAProxy still sees the failed servers as up and tries to send traffic to them. The current HAProxy health check is a simple TCP check, I think it should be replaced by a HTTPS check for the /healthz endpoint.

Also review that we have alerts for the case where the API servers are not healthy according to HAProxy.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
kubernetes: add some basic HAproxy alertsrepos/cloud/toolforge/alerts!15aborreroarturo-220-kubernetes-add-somemain
Customize query in GitLab

Event Timeline

aborrero triaged this task as High priority.
aborrero moved this task from Backlog to Next on the User-aborrero board.
taavi removed taavi as the assignee of this task.Jun 13 2024, 9:39 AM
taavi updated the task description. (Show Details)
taavi updated the task description. (Show Details)
dcaro renamed this task from toolforge: improve HAproxy and k8s apiserver interaction to [k8s,infra,alerting] improve HAproxy and k8s apiserver interaction.Jun 13 2024, 9:46 AM
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.
dcaro subscribed.

Change #1047113 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: haproxy: use HTTP healthcheck for the k8s api-server

https://gerrit.wikimedia.org/r/1047113

Change #1047113 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: haproxy: check the k8s api-server /healthz endpoint

https://gerrit.wikimedia.org/r/1047113

aborrero lowered the priority of this task from High to Medium.Jun 19 2024, 11:04 AM
aborrero claimed this task.

Alert confirmed to work as expected:

image.png (313×522 px, 50 KB)