Page MenuHomePhabricator

Create a controlled and ongoing CI pipeline test job that we can alert on
Open, Needs TriagePublic

Description

Something in the spirit of https://gerrit.wikimedia.org/r/#/c/336413/ that would allow us to reason on outages/issues that are not obvious from watching markers of ongoing adhoc operations such as https://gerrit.wikimedia.org/r/#/c/335373/.

I would think something like a scheduled submission to the pipeline where the "vote" is a response back to the initiating mechanism for success or failure of some playbook of automated test conditions. Possibly run directly from the jenkins server itself.

Event Timeline

chasemp created this task.Feb 14 2017, 1:43 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2017, 1:43 PM
hashar added a subscriber: hashar.Feb 16 2017, 7:34 PM

Ok here the random crap idea.

In Nodepool define a new label attached to an image and with a number of ready instance set to zero.

Craft a job on Jenkins that run on some not Nodepool slave. Make it spawn a child job that runs on the label defined above. Have the parent to timeout after X minutes and report back to IRC / email whatnot.

The gotcha: if the job is triggered by the Jenkins scheduler, I don't think Nodepool will notice there is demand for the associated label. Nodepool checks the demand by asking the Zuul gearman server. Gotta verify that part.

Tested. A job scheduled directly by Jenkins bypass the Zuul gearman server as expected and thus Nodepool cant find there is demand for it. The build completion does emit an event over ZeroMQ and Nodepool garbage collect the instance as expected.

Plan B: get Zuul to run the job on a timer. But that is a bit scarier.

Change 338179 had a related patch set uploaded (by Hashar):
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

https://gerrit.wikimedia.org/r/338179 is an implementation of plan B. The commit message explains it all but I dont think it is going to work properly.

Change 338179 abandoned by Hashar:
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

hashar moved this task from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.