Quarterly goal for the Labs team is to have 99.5% provable uptime for toollabs. So I'd like to have catchpoint calculate that and use that as the single source of numbers.
Plan being:
Labs instance will host a http server that does a very specific check to determine if a 'service' (as defined in T93622) is up or not. Example: tools.wmflabs.org/canary/nfs/home. This will be hosted outside of the regular webservice workflow and have as few dependencies as possible.
We setup catchpoint API / object checks to hit these endpoints and calculate metrics
Things that'll go through the tools checker service
~~NFS Is available and writeable / readable~~
~~Redis is available and writeable / readable~~
~~Submitting a new job to the grid (precise) has it executing within 10s~~
~~Submitting a new job to the grid (trusty) has it executing within 10s~~
Starting a webservice (lighttpd) takes less than 10s to it being able to serve
Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve
~~Cron runs as it should~~
~~Writing to all three labsdb replicas works~~
~~Reading all three labsdb replicas works~~
~~Reading and Writing to tools-db works~~
~~NFS dumps are readable~~
~~Self checks for the canary service itself.~~