There's a few jenkins slaves down at the moment, due to I believe one or two cloudvirt hosts being down. Among others integration-castor03 is down, which seems to be causing all jobs depending on castor-save-workspace-cache to be waiting indefinitely.
|Resolved||thcipriani||T216039 jenkins / zuul backing up due to jenkins slaves down|
|Open||None||T216244 Don't hardcode castor url in castor docker container|
|Resolved||jeena||T213806 Migrate wikimedia-portals-build to Docker container|
|Resolved||Jdforrester-WMF||T237479 Update the wikimedia-portals repo's CI/linting code for various security issues|
|Resolved||Jdrewniak||T247996 Fix issues with Gulp 4 migration|
|Resolved||Addshore||T210286 Create docker based CI job to build the wikidata-query-gui|
|Declined||None||T192006 wdqs-frontend docker image should (BLUBBER) rebuild automatically when a new patch is pushed to master|
|Resolved||Addshore||T209206 Wikidata Query GUI (wikidata/query/gui) fails tests on initial clone on Debian|
|Declined||None||T209292 Create a wmf production ready nginx image|
Ugh. For the moment (while integration-castor03 is down) I've modified castor-save-workspace-cache to be a no-op (exit 0) and run on nodes labeled blubber (of which there are 13).
The queue seems to have cleared fairly quickly.
Currently castor-save-workspace-cache is used to archive the cache for jobs so that future jobs will be faster, so it's not 100% critical that it function. I've run into this before (attempting to safe-restart Jenkins and having jobs hang indefinitely since they are unable to start new jobs). Ideally, the castor-save-workspace job would some timeout in waiting for job execution to be scheduled. Unsure how to implement that offhand. Adding @dduvall to see if he has thoughts or magic to offer.
Built a new integration-castor and undid my dirty hacks to the castor-save-* jobs. Lowering priority but leaving open: we need a way to ensure that non-essential parts of a build (like saving the cache) will alert ci-folks, but not stop ci from working.
We can resolve this task since Tyler did the emergency action. The castor-save job could not be triggered due to lack of a Jenkins agent to run on which in turns caused a lot of jobs to be blocked while waiting for the job to complete
Indeed most jobs in gate-and-submit end up triggering the castor-save job. That should be refactored there are few possibilities:
- add a --contimeout to rsync when doing the save
- find out whether we can timeout a job when there are no agents available (I doubt it is possible)
- as a post merge job
- by polling the scm and refreshing the cache
- by using a local caching proxy instead of our homemade rsync based storage ( T147635 )