Page MenuHomePhabricator

Explicitly define all the services that Tool Labs provides and their interfaces
Closed, ResolvedPublic

Description

Explicitly define each 'service' toollabs provides, and what the documented interfaces to them are. Once this is done, we'll codify them into scripts that emit boolean values (up / down), and use those to monitor uptime.

  1. Bastion Hosts
    1. SSH works
    2. CPU / Memory usage within sane levels, so interactive use is possible
  2. Grid engine
    1. Job starts executing within X seconds
    2. Continuous jobs execute continuously
    3. Queues / Jobs in Errored state don't exceed threshold
    4. check that the number of pending jobs is under some threshold
  3. Webservices
    1. Webservice starts in X secods
    2. Webservice stays up as long as there is no developer error, executing continuously
    3. Webservice available to the external world continuously
  4. Cron services
    1. Crons run when they should
  5. Redis
    1. Write an entry (appropriate timeout)
    2. Read an entry just written (appropriate timeout)
  6. Tools DB
    1. Write some data to it (appropriate timeout)
    2. Read some data from it (appropriate timeout)
  7. LabsDB
    1. Replag is 0
    2. Read queries execute reasonably quickly (set up a stable set of 'reference' queries, and measure their output)
    3. Write access on user databases (appropriate timeout)
  8. NFS (Homes)
    1. Read random file (appropriate timeout)
    2. Write random file (appropriate timeout)
  9. NFS (Dumps)
    1. Read random file (appropriate timeout)
    2. New dumps available within X minutes / hours of them appearing on dumps.wikimedia.org
  10. Misc
    1. Tools home page is available and showing sensible things

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda subscribed.

(moved to description)

yuvipanda set Security to None.
yuvipanda added subscribers: coren, scfc, Magnus.

^ what I could think of. Others? please add in comments.

Job start delays are probably a poor metric to measure unless you mean a specific, test job with well-constrained requirements. One of the things gridengine does by design is delay jobs that request more resources unless there is plenty of spare capacity.

@coren Oh yeah, totally. The way to measure a lot of these would be with references - have a reference test job, reference sql queries, reference webservice, etc, and test them.

Plausible additions:
Grid Engine: check that the number of errored out jobs is under some threshold
Grid Engine: check that the number of pending jobs is under some threshold
Webservices: the admin service is running and the status and list page return sensible results

yuvipanda updated the task description. (Show Details)

(I have updated description to match)

Next step would be to create tasks for each service and figure out how we are going to track historical data. I am considering just putting them on graphite to start with...

scfc triaged this task as Medium priority.Apr 6 2015, 8:20 AM
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.
yuvipanda claimed this task.

T97748 and T97610 are the ones left to do. This is already 'defined'.