Setup a tools checker service that can check all internal services for availability
Closed, ResolvedPublic
Actions

Description

Quarterly goal for the Labs team is to have 99.5% provable uptime for toollabs. So I'd like to have catchpoint calculate that and use that as the single source of numbers.

Plan being:

Labs instance will host a http server that does a very specific check to determine if a 'service' (as defined in T93622) is up or not. Example: tools.wmflabs.org/canary/nfs/home. This will be hosted outside of the regular webservice workflow and have as few dependencies as possible.
We setup catchpoint API / object checks to hit these endpoints and calculate metrics

Things that'll go through the tools checker service

~~NFS Is available and writeable / readable~~
~~Redis is available and writeable / readable~~
~~Submitting a new job to the grid (precise) has it executing within 10s~~
~~Submitting a new job to the grid (trusty) has it executing within 10s~~
~~Starting a webservice (lighttpd) takes less than 10s to it being able to serve~~
~~Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve~~
~~Cron runs as it should~~
~~Writing to all three labsdb replicas works~~
~~Reading all three labsdb replicas works~~
~~Reading and Writing to tools-db works~~
~~NFS dumps are readable~~
~~Self checks for the canary service itself.~~

Details

Subject	Repo	Branch	Lines +/-
toolschecker: read/write test for labsdb1004	operations/puppet	production	+30 -6
Added a test for toollabs cron.	operations/puppet	production	+12 -0
Toolschecker: Fix the test url for the toolsdb check	operations/puppet	production	+1 -1
Add a read/write/delete check for tools-db	operations/puppet	production	+27 -0
Added tests for grid job submission.	operations/puppet	production	+39 -0
Add check for /public/dumps	operations/puppet	production	+12 -0
tools: Add check for long running precise / trusty jobs	operations/puppet	production	+20 -0
tools: Add a toolschecker role / module / endpoint	operations/puppet	production	+117 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T105720 Labs team reliability goal for Q1 2015/16
Resolved	Andrew	T107058 Have catchpoint checks for all labs services
Resolved	Andrew	T97748 Setup a tools checker service that can check all internal services for availability

Event Timeline

yuvipanda created this task.May 1 2015, 4:30 AM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added projects: Toolforge, ToolLabs-Goals-Q4.

yuvipanda subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 1 2015, 4:30 AM

Change 208067 had a related patch set uploaded (by Yuvipanda):
tools: Add a toolschecker role / module / endpoint

https://gerrit.wikimedia.org/r/208067

gerritbot added a project: Patch-For-Review.May 1 2015, 4:30 AM

Change 208067 merged by Yuvipanda:
tools: Add a toolschecker role / module / endpoint

https://gerrit.wikimedia.org/r/208067

yuvipanda mentioned this in rOPUP7fcb8efc1f49: tools: Add a toolschecker role / module / endpoint.May 1 2015, 4:39 AM

Krinkle updated the task description. (Show Details)May 1 2015, 5:31 AM

Krinkle set Security to None.

Krinkle subscribed.

Metrics will be made available as I add them on http://p.catchpoint.com/ui/Entry/PD/V/A.RNP-Ov-jSUbDu8Jdg/ErLK. Note that this is all still WIP.

NFS, redis, lighttpd - precise, lighttpd - trusty, lighttpd uwsg-python tests done :D

yuvipanda mentioned this in T93622: Explicitly define all the services that Tool Labs provides and their interfaces.May 4 2015, 10:44 PM

Change 208880 had a related patch set uploaded (by Yuvipanda):
tools: Add check for long running precise / trusty jobs

https://gerrit.wikimedia.org/r/208880

Change 208880 merged by Yuvipanda:
tools: Add check for long running precise / trusty jobs

https://gerrit.wikimedia.org/r/208880

yuvipanda mentioned this in rOPUPddc95e011c66: tools: Add check for long running precise / trusty jobs.May 5 2015, 10:33 PM

yuvipanda mentioned this in T97321: Add catchall tests for toollabs to catchpoint.May 8 2015, 12:41 AM

valhallasw triaged this task as Medium priority.May 10 2015, 8:03 PM

valhallasw subscribed.

valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.May 10 2015, 8:43 PM

yuvipanda moved this task from Backlog to Measurement / Monitoring on the ToolLabs-Goals-Q4 board.May 10 2015, 9:52 PM

yuvipanda added a parent task: T107058: Have catchpoint checks for all labs services.Jul 27 2015, 10:40 PM

Ricordisamoa subscribed.Jul 27 2015, 11:08 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 27 2015, 11:08 PM

yuvipanda updated the task description. (Show Details)Sep 14 2015, 9:24 PM

Change 238863 had a related patch set uploaded (by Andrew Bogott):
Added tests for grid job submission.

https://gerrit.wikimedia.org/r/238863

Change 238960 had a related patch set uploaded (by Andrew Bogott):
Add check for /public/dumps

https://gerrit.wikimedia.org/r/238960

Andrew added a project: Labs-Sprint-114.Sep 17 2015, 2:59 PM

Andrew subscribed.

Change 238960 merged by Andrew Bogott:
Add check for /public/dumps

https://gerrit.wikimedia.org/r/238960

Andrew mentioned this in rOPUP0a08fcea90a0: Add check for /public/dumps.Sep 17 2015, 3:10 PM

Andrew moved this task from To do to Doing on the Labs-Sprint-114 board.Sep 17 2015, 3:13 PM

Andrew updated the task description. (Show Details)

Change 239182 had a related patch set uploaded (by Andrew Bogott):
Add a read/write/delete check for tools-db

https://gerrit.wikimedia.org/r/239182

Change 238863 merged by Andrew Bogott:
Added tests for grid job submission.

https://gerrit.wikimedia.org/r/238863

Andrew mentioned this in rOPUP2c1b51d500df: Added tests for grid job submission..Sep 17 2015, 8:02 PM

Change 239182 merged by Andrew Bogott:
Add a read/write/delete check for tools-db

https://gerrit.wikimedia.org/r/239182

Andrew mentioned this in rOPUPe76acf2ca0ff: Add a read/write/delete check for tools-db.Sep 17 2015, 8:06 PM

Change 239196 had a related patch set uploaded (by Andrew Bogott):
Toolschecker: Fix the test url for the toolsdb check

https://gerrit.wikimedia.org/r/239196

Change 239196 merged by Andrew Bogott:
Toolschecker: Fix the test url for the toolsdb check

https://gerrit.wikimedia.org/r/239196

Andrew mentioned this in rOPUP45ce5f69afef: Toolschecker: Fix the test url for the toolsdb check.Sep 17 2015, 8:19 PM

Andrew updated the task description. (Show Details)Sep 17 2015, 8:29 PM

I could use guidance for the remaining tasks:

Starting a webservice (lighttpd) takes less than 10s to it being able to serve
Starting a webservice in generic host (nodejs / uwsgi-python) takes less than 10s to it being able to serve

Can you point me to sample code for a trivial webservice I can use for this?

Cron runs as it should

I assume you don't mean normal system cron but our weird meta-cron. Is it enough to verify that something is running, or do we need to test the creation/replication across multiple hosts?

~~Writing to all three labsdb replicas works~~

~~Tools only write to tools-db, right? Do we need write checks on 1001-1005?~~

Andrew updated the task description. (Show Details)Sep 17 2015, 10:15 PM

Andrew moved this task from Doing to Code Review/Blocked on the Labs-Sprint-114 board.

Change 239438 had a related patch set uploaded (by Andrew Bogott):
Added a test for toollabs cron.

https://gerrit.wikimedia.org/r/239438

Change 239438 merged by Andrew Bogott:
Added a test for toollabs cron.

https://gerrit.wikimedia.org/r/239438

Andrew mentioned this in rOPUP7ba44a095b2c: Added a test for toollabs cron..Sep 18 2015, 6:53 PM

Andrew updated the task description. (Show Details)Sep 18 2015, 6:58 PM

Andrew claimed this task.Sep 18 2015, 7:46 PM

Andrew added a project: Labs-Sprint-115.Sep 21 2015, 5:05 PM

Change 239183 had a related patch set uploaded (by Andrew Bogott):
toolschecker: read/write test for labsdb1004

https://gerrit.wikimedia.org/r/239183

Change 239183 merged by Andrew Bogott:
toolschecker: read/write test for labsdb1004

https://gerrit.wikimedia.org/r/239183

Andrew mentioned this in rOPUP3a37f7b88c74: toolschecker: read/write test for labsdb1004.Sep 21 2015, 6:49 PM

Andrew moved this task from To do to Code Review/Blocked on the Labs-Sprint-115 board.Sep 21 2015, 8:52 PM

Andrew closed this task as Resolved.Sep 22 2015, 7:39 PM

Andrew updated the task description. (Show Details)

Andrew moved this task from Code Review/Blocked to Done on the Labs-Sprint-114 board.Sep 24 2015, 4:24 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:51 PM

Setup a tools checker service that can check all internal services for availabilityClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Setup a tools checker service that can check all internal services for availability
Closed, ResolvedPublic
Actions

Related Objects
Search...