As part of https://phabricator.wikimedia.org/T130972#2178746 we realized there is no monitoring surrounding this so we'll have to get that going before we make it live for real users.
Description
Description
Details
Details
Related Changes in Gerrit:
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | yuvipanda | T129309 Goal: Allow using k8s instead of GridEngine as a backend for webservices | |||
| Declined | None | T131929 Setup monitoring for kubernetes core components. | |||
| Resolved | yuvipanda | T140246 Monitor k8s flannel etcd health | |||
| Resolved | yuvipanda | T140247 Health check for k8s etcd | |||
| Resolved | bd808 | T140248 Check that all k8s nodes are in 'ready' condition | |||
| Open | None | T140249 [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes | |||
| Declined | None | T140561 Monitor that not too many replicasets have a big difference between desired and current+pending | |||
| Declined | None | T142164 Build replacement for the webservice toolschecker test |
Event Timeline
Comment Actions
Minimum required is just to check:
- All the processes that are running are running
- All the things that should be marked as ready are marked as ready
Not fully sure how to do this now.
Comment Actions
Change 297575 had a related patch set uploaded (by Yuvipanda):
tools: Add a check for k8s backed webservices
Comment Actions
Change 297771 had a related patch set uploaded (by Yuvipanda):
tools: Fix k8s webservice backend check
Comment Actions
Change 297774 had a related patch set uploaded (by Yuvipanda):
tools: Add icinga check for kubernetes webservice
Comment Actions
This will check for the webservice to start and stop, which is exercising the following things;
- Master is reachable and responsive
- Docker registry is reachable and responsive
- there's enough capacity to schedule at least one web pod
- kube2proxy and whole proxying system is reachable
That's a pretty ok if convoluted check!
Comment Actions
At this point, while we have monitoring, we need to set up something more of a monitor for toolforge in general, which is not really captured by this ticket.