Quarterly goal for the Labs team is to have 99.5% provable uptime for toollabs. So I'd like to have catchpoint calculate that and use that as the single source of numbers.
Plan being:
Labs instance will host a http server that does a very specific check to determine if a 'service' (as defined in T93622) is up or not. Example: tools.wmflabs.org/canary/nfs/home. This will be hosted outside of the regular webservice workflow and have as few dependencies as possible.
We setup catchpoint API / object checks to hit these endpoints and calculate metrics
Things that'll go through the tools checker service
NFS Is available and writeable / readable
Redis is available and writeable / readable
Submitting a new job to the grid (precise) has it executing within 10s
Submitting a new job to the grid (trusty) has it executing within 10s
Starting a webservice (lighttpd) takes less than 10s to it being able to serve
Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve
Cron runs as it should
Writing to all three labsdb replicas works
Reading all three labsdb replicas works
Reading and Writing to tools-db works
NFS dumps are readable
Self checks for the canary service itself.