jenkins / zuul backing up due to jenkins slaves down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Feb 13 2019, 2:20 PM

Description

There's a few jenkins slaves down at the moment, due to I believe one or two cloudvirt hosts being down. Among others integration-castor03 is down, which seems to be causing all jobs depending on castor-save-workspace-cache to be waiting indefinitely.

2019-02-13-151717_911x414_scrot.png (414×911 px, 51 KB)

Related Objects
Search...

Status	Assigned	Task
Resolved	thcipriani	T216039 jenkins / zuul backing up due to jenkins slaves down
Resolved	hashar	T216244 Don't hardcode castor url in castor docker container
Resolved	jeena	T213806 Migrate wikimedia-portals-build to Docker container
Resolved	Jdforrester-WMF	T237479 Update the wikimedia-portals repo's CI/linting code for various security issues
Resolved	Jdrewniak	T247996 Fix issues with Gulp 4 migration
Resolved	Addshore	T210286 Create docker based CI job to build the wikidata-query-gui
Declined	None	T192006 wdqs-frontend docker image should (BLUBBER) rebuild automatically when a new patch is pushed to master
Resolved	Addshore	T209206 Wikidata Query GUI (wikidata/query/gui) fails tests on initial clone on Debian
Declined	None	T209292 Create a wmf production ready nginx image

Event Timeline

fgiunchedi created this task.Feb 13 2019, 2:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2019, 2:20 PM

WMDE-Fisch subscribed.Feb 13 2019, 2:23 PM

elukey subscribed.Feb 13 2019, 2:26 PM

fgiunchedi triaged this task as High priority.Feb 13 2019, 2:29 PM

Ugh. For the moment (while integration-castor03 is down) I've modified castor-save-workspace-cache to be a no-op (exit 0) and run on nodes labeled blubber (of which there are 13).

The queue seems to have cleared fairly quickly.

Currently castor-save-workspace-cache is used to archive the cache for jobs so that future jobs will be faster, so it's not 100% critical that it function. I've run into this before (attempting to safe-restart Jenkins and having jobs hang indefinitely since they are unable to start new jobs). Ideally, the castor-save-workspace job would some timeout in waiting for job execution to be scheduled. Unsure how to implement that offhand. Adding @dduvall to see if he has thoughts or magic to offer.

This is T216030 https://lists.wikimedia.org/pipermail/cloud/2019-February/000538.html

xSavitar subscribed.Feb 13 2019, 4:39 PM

Built a new integration-castor and undid my dirty hacks to the castor-save-* jobs. Lowering priority but leaving open: we need a way to ensure that non-essential parts of a build (like saving the cache) will alert ci-folks, but not stop ci from working.

We can resolve this task since Tyler did the emergency action. The castor-save job could not be triggered due to lack of a Jenkins agent to run on which in turns caused a lot of jobs to be blocked while waiting for the job to complete

Indeed most jobs in gate-and-submit end up triggering the castor-save job. That should be refactored there are few possibilities:

add a --contimeout to rsync when doing the save
find out whether we can timeout a job when there are no agents available (I doubt it is possible)
as a post merge job
by polling the scm and refreshing the cache
by using a local caching proxy instead of our homemade rsync based storage ( T147635 )

hashar mentioned this in T216053: Jenkins failing everything due to npm being screwed up.Feb 20 2019, 2:32 PM

hashar closed subtask T216244: Don't hardcode castor url in castor docker container as Resolved.Mar 17 2022, 2:31 PM

	F28209599: 2019-02-13-151717_911x414_scrot.png
	Feb 13 2019, 2:20 PM

jenkins / zuul backing up due to jenkins slaves downClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

jenkins / zuul backing up due to jenkins slaves down
Closed, ResolvedPublic
Actions

Related Objects
Search...