Page MenuHomePhabricator

jenkins / zuul backing up due to jenkins slaves down
Closed, ResolvedPublic

Description

There's a few jenkins slaves down at the moment, due to I believe one or two cloudvirt hosts being down. Among others integration-castor03 is down, which seems to be causing all jobs depending on castor-save-workspace-cache to be waiting indefinitely.

2019-02-13-151717_911x414_scrot.png (414×911 px, 51 KB)

See also: https://integration.wikimedia.org/ci/job/castor-save-workspace-cache/ and https://integration.wikimedia.org/ci/computer/

Event Timeline

Ugh. For the moment (while integration-castor03 is down) I've modified castor-save-workspace-cache to be a no-op (exit 0) and run on nodes labeled blubber (of which there are 13).

The queue seems to have cleared fairly quickly.

Currently castor-save-workspace-cache is used to archive the cache for jobs so that future jobs will be faster, so it's not 100% critical that it function. I've run into this before (attempting to safe-restart Jenkins and having jobs hang indefinitely since they are unable to start new jobs). Ideally, the castor-save-workspace job would some timeout in waiting for job execution to be scheduled. Unsure how to implement that offhand. Adding @dduvall to see if he has thoughts or magic to offer.

thcipriani lowered the priority of this task from High to Medium.Feb 13 2019, 6:55 PM

Built a new integration-castor and undid my dirty hacks to the castor-save-* jobs. Lowering priority but leaving open: we need a way to ensure that non-essential parts of a build (like saving the cache) will alert ci-folks, but not stop ci from working.

hashar assigned this task to thcipriani.
hashar added a subscriber: hashar.

We can resolve this task since Tyler did the emergency action. The castor-save job could not be triggered due to lack of a Jenkins agent to run on which in turns caused a lot of jobs to be blocked while waiting for the job to complete

Indeed most jobs in gate-and-submit end up triggering the castor-save job. That should be refactored there are few possibilities:

  • add a --contimeout to rsync when doing the save
  • find out whether we can timeout a job when there are no agents available (I doubt it is possible)
  • as a post merge job
  • by polling the scm and refreshing the cache
  • by using a local caching proxy instead of our homemade rsync based storage ( T147635 )