As part of https://phabricator.wikimedia.org/T130972#2178746 we realized there is no monitoring surrounding this so we'll have to get that going before we make it live for real users.
Description
Description
Details
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T129309 Goal: Allow using k8s instead of GridEngine as a backend for webservices | |||
Declined | None | T131929 Setup monitoring for kubernetes core components. | |||
Resolved | yuvipanda | T140246 Monitor k8s flannel etcd health | |||
Resolved | yuvipanda | T140247 Health check for k8s etcd | |||
Resolved | bd808 | T140248 Check that all k8s nodes are in 'ready' condition | |||
Open | None | T140249 [toolforge.infra] Run https://github.com/kubernetes/node-problem-detector on all our nodes | |||
Declined | None | T140561 Monitor that not too many replicasets have a big difference between desired and current+pending | |||
Declined | None | T142164 Build replacement for the webservice toolschecker test |
Event Timeline
Comment Actions
Minimum required is just to check:
- All the processes that are running are running
- All the things that should be marked as ready are marked as ready
Not fully sure how to do this now.
Comment Actions
Change 297575 had a related patch set uploaded (by Yuvipanda):
tools: Add a check for k8s backed webservices
Comment Actions
Change 297771 had a related patch set uploaded (by Yuvipanda):
tools: Fix k8s webservice backend check
Comment Actions
Change 297774 had a related patch set uploaded (by Yuvipanda):
tools: Add icinga check for kubernetes webservice
Comment Actions
This will check for the webservice to start and stop, which is exercising the following things;
- Master is reachable and responsive
- Docker registry is reachable and responsive
- there's enough capacity to schedule at least one web pod
- kube2proxy and whole proxying system is reachable
That's a pretty ok if convoluted check!
Comment Actions
At this point, while we have monitoring, we need to set up something more of a monitor for toolforge in general, which is not really captured by this ticket.