Explicitly define all the services that Tool Labs provides and their interfaces
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Mar 23 2015, 6:23 PM

Description

Explicitly define each 'service' toollabs provides, and what the documented interfaces to them are. Once this is done, we'll codify them into scripts that emit boolean values (up / down), and use those to monitor uptime.

Bastion Hosts
1. SSH works
2. CPU / Memory usage within sane levels, so interactive use is possible
Grid engine
1. Job starts executing within X seconds
2. Continuous jobs execute continuously
3. Queues / Jobs in Errored state don't exceed threshold
4. check that the number of pending jobs is under some threshold
Webservices
1. Webservice starts in X secods
2. Webservice stays up as long as there is no developer error, executing continuously
3. Webservice available to the external world continuously
Cron services
1. Crons run when they should
Redis
1. Write an entry (appropriate timeout)
2. Read an entry just written (appropriate timeout)
Tools DB
1. Write some data to it (appropriate timeout)
2. Read some data from it (appropriate timeout)
LabsDB
1. Replag is 0
2. Read queries execute reasonably quickly (set up a stable set of 'reference' queries, and measure their output)
3. Write access on user databases (appropriate timeout)
NFS (Homes)
1. Read random file (appropriate timeout)
2. Write random file (appropriate timeout)
NFS (Dumps)
1. Read random file (appropriate timeout)
2. New dumps available within X minutes / hours of them appearing on dumps.wikimedia.org
Misc
1. Tools home page is available and showing sensible things

Related Objects

Mentioned In: T97748: Setup a tools checker service that can check all internal services for availability
T90535: Define expected service level agreement for tools
Mentioned Here: T97748: Setup a tools checker service that can check all internal services for availability

Event Timeline

yuvipanda created this task.Mar 23 2015, 6:23 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: Toolforge.

yuvipanda subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2015, 6:23 PM

(moved to description)

^ what I could think of. Others? please add in comments.

Job start delays are probably a poor metric to measure unless you mean a specific, test job with well-constrained requirements. One of the things gridengine does by design is delay jobs that request more resources unless there is plenty of spare capacity.

@coren Oh yeah, totally. The way to measure a lot of these would be with references - have a reference test job, reference sql queries, reference webservice, etc, and test them.

yuvipanda added a project: ToolLabs-Goals-Q4.Mar 25 2015, 8:01 PM

Plausible additions:
Grid Engine: check that the number of errored out jobs is under some threshold
Grid Engine: check that the number of pending jobs is under some threshold
Webservices: the admin service is running and the status and list page return sensible results

yuvipanda updated the task description. (Show Details)Mar 25 2015, 8:04 PM

yuvipanda updated the task description. (Show Details)

yuvipanda mentioned this in T90535: Define expected service level agreement for tools.Mar 25 2015, 9:25 PM

yuvipanda moved this task from Backlog to Measurement / Monitoring on the ToolLabs-Goals-Q4 board.Mar 25 2015, 9:36 PM

yuvipanda added a project: Labs-Q4-Sprint-1.Mar 28 2015, 8:26 PM

(I have updated description to match)

Next step would be to create tasks for each service and figure out how we are going to track historical data. I am considering just putting them on graphite to start with...

yuvipanda moved this task from Backlog to Doing on the Labs-Q4-Sprint-1 board.Apr 1 2015, 12:15 AM

scfc triaged this task as Medium priority.Apr 6 2015, 8:20 AM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

yuvipanda mentioned this in T97748: Setup a tools checker service that can check all internal services for availability.May 1 2015, 4:30 AM

T97748 and T97610 are the ones left to do. This is already 'defined'.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:53 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 7 2017, 6:53 PM

Explicitly define all the services that Tool Labs provides and their interfacesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Explicitly define all the services that Tool Labs provides and their interfaces
Closed, ResolvedPublic
Actions