Explicitly define each 'service' toollabs provides, and what the documented interfaces to them are. Once this is done, we'll codify them into scripts that emit boolean values (up / down), and use those to monitor uptime.
- Bastion Hosts
- SSH works
- CPU / Memory usage within sane levels, so interactive use is possible
- Grid engine
- Job starts executing within X seconds
- Continuous jobs execute continuously
- Queues / Jobs in Errored state don't exceed threshold
- check that the number of pending jobs is under some threshold
- Webservices
- Webservice starts in X secods
- Webservice stays up as long as there is no developer error, executing continuously
- Webservice available to the external world continuously
- Cron services
- Crons run when they should
- Redis
- Write an entry (appropriate timeout)
- Read an entry just written (appropriate timeout)
- Tools DB
- Write some data to it (appropriate timeout)
- Read some data from it (appropriate timeout)
- LabsDB
- Replag is 0
- Read queries execute reasonably quickly (set up a stable set of 'reference' queries, and measure their output)
- Write access on user databases (appropriate timeout)
- NFS (Homes)
- Read random file (appropriate timeout)
- Write random file (appropriate timeout)
- NFS (Dumps)
- Read random file (appropriate timeout)
- New dumps available within X minutes / hours of them appearing on dumps.wikimedia.org
- Misc
- Tools home page is available and showing sensible things