Change Details

Quarterly goal for the Labs team is to have 99.5% provable uptime for toollabs. So I'd like to have catchpoint calculate that and use that as the single source of numbers. Plan being: Labs instance will host a http server that does a very specific check to determine if a 'service' (as defined in T93622) is up or not. Example: tools.wmflabs.org/canary/nfs/home. This will be hosted outside of the regular webservice workflow and have as few dependencies as possible. We setup catchpoint API / object checks to hit these endpoints and calculate metrics Things that'll go through the tools checker service ~~NFS Is available and writeable / readable~~ ~~Redis is available and writeable / readable~~ ~~Submitting a new job to the grid (precise) has it executing within 10s~~ ~~Submitting a new job to the grid (trusty) has it executing within 10s~~ ~~Starting a webservice (lighttpd) takes less than 10s to it being able to serve~~ ~~Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve~~ ~~Cron runs as it should~~ ~~Writing to all three labsdb replicas works~~ ~~Reading all three labsdb replicas works~~ ~~Reading and Writing to tools-db works~~ ~~NFS dumps are readable~~ ~~Self checks for the canary service itself.~~