Page MenuHomePhabricator

Setup a tools checker service that can check all internal services for availability
Closed, ResolvedPublic

Description

Quarterly goal for the Labs team is to have 99.5% provable uptime for toollabs. So I'd like to have catchpoint calculate that and use that as the single source of numbers.

Plan being:

Labs instance will host a http server that does a very specific check to determine if a 'service' (as defined in T93622) is up or not. Example: tools.wmflabs.org/canary/nfs/home. This will be hosted outside of the regular webservice workflow and have as few dependencies as possible.
We setup catchpoint API / object checks to hit these endpoints and calculate metrics

Things that'll go through the tools checker service

NFS Is available and writeable / readable
Redis is available and writeable / readable
Submitting a new job to the grid (precise) has it executing within 10s
Submitting a new job to the grid (trusty) has it executing within 10s
Starting a webservice (lighttpd) takes less than 10s to it being able to serve
Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve
Cron runs as it should
Writing to all three labsdb replicas works
Reading all three labsdb replicas works
Reading and Writing to tools-db works
NFS dumps are readable
Self checks for the canary service itself.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda subscribed.

Change 208067 had a related patch set uploaded (by Yuvipanda):
tools: Add a toolschecker role / module / endpoint

https://gerrit.wikimedia.org/r/208067

Change 208067 merged by Yuvipanda:
tools: Add a toolschecker role / module / endpoint

https://gerrit.wikimedia.org/r/208067

Krinkle set Security to None.
Krinkle subscribed.

Metrics will be made available as I add them on http://p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/ErLK. Note that this is all still WIP.

NFS, redis, lighttpd - precise, lighttpd - trusty, lighttpd uwsg-python tests done :D

Change 208880 had a related patch set uploaded (by Yuvipanda):
tools: Add check for long running precise / trusty jobs

https://gerrit.wikimedia.org/r/208880

Change 208880 merged by Yuvipanda:
tools: Add check for long running precise / trusty jobs

https://gerrit.wikimedia.org/r/208880

valhallasw triaged this task as Medium priority.May 10 2015, 8:03 PM
valhallasw subscribed.

Change 238863 had a related patch set uploaded (by Andrew Bogott):
Added tests for grid job submission.

https://gerrit.wikimedia.org/r/238863

Change 238960 had a related patch set uploaded (by Andrew Bogott):
Add check for /public/dumps

https://gerrit.wikimedia.org/r/238960

Change 238960 merged by Andrew Bogott:
Add check for /public/dumps

https://gerrit.wikimedia.org/r/238960

Andrew updated the task description. (Show Details)

Change 239182 had a related patch set uploaded (by Andrew Bogott):
Add a read/write/delete check for tools-db

https://gerrit.wikimedia.org/r/239182

Change 238863 merged by Andrew Bogott:
Added tests for grid job submission.

https://gerrit.wikimedia.org/r/238863

Change 239182 merged by Andrew Bogott:
Add a read/write/delete check for tools-db

https://gerrit.wikimedia.org/r/239182

Change 239196 had a related patch set uploaded (by Andrew Bogott):
Toolschecker: Fix the test url for the toolsdb check

https://gerrit.wikimedia.org/r/239196

Change 239196 merged by Andrew Bogott:
Toolschecker: Fix the test url for the toolsdb check

https://gerrit.wikimedia.org/r/239196

I could use guidance for the remaining tasks:

  • Starting a webservice (lighttpd) takes less than 10s to it being able to serve
  • Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve

Can you point me to sample code for a trivial webservice I can use for this?

  • Cron runs as it should

I assume you don't mean normal system cron but our weird meta-cron. Is it enough to verify that something is running, or do we need to test the creation/replication across multiple hosts?

  • Writing to all three labsdb replicas works

Tools only write to tools-db, right? Do we need write checks on 1001-1005?

Change 239438 had a related patch set uploaded (by Andrew Bogott):
Added a test for toollabs cron.

https://gerrit.wikimedia.org/r/239438

Change 239438 merged by Andrew Bogott:
Added a test for toollabs cron.

https://gerrit.wikimedia.org/r/239438

Change 239183 had a related patch set uploaded (by Andrew Bogott):
toolschecker: read/write test for labsdb1004

https://gerrit.wikimedia.org/r/239183

Change 239183 merged by Andrew Bogott:
toolschecker: read/write test for labsdb1004

https://gerrit.wikimedia.org/r/239183

Andrew updated the task description. (Show Details)