Page MenuHomePhabricator

Create a controlled and ongoing CI pipeline test job that we can alert on
Closed, DeclinedPublic

Description

Something in the spirit of https://gerrit.wikimedia.org/r/#/c/336413/ that would allow us to reason on outages/issues that are not obvious from watching markers of ongoing adhoc operations such as https://gerrit.wikimedia.org/r/#/c/335373/.

I would think something like a scheduled submission to the pipeline where the "vote" is a response back to the initiating mechanism for success or failure of some playbook of automated test conditions. Possibly run directly from the jenkins server itself.

Event Timeline

Ok here the random crap idea.

In Nodepool define a new label attached to an image and with a number of ready instance set to zero.

Craft a job on Jenkins that run on some not Nodepool slave. Make it spawn a child job that runs on the label defined above. Have the parent to timeout after X minutes and report back to IRC / email whatnot.

The gotcha: if the job is triggered by the Jenkins scheduler, I don't think Nodepool will notice there is demand for the associated label. Nodepool checks the demand by asking the Zuul gearman server. Gotta verify that part.

Tested. A job scheduled directly by Jenkins bypass the Zuul gearman server as expected and thus Nodepool cant find there is demand for it. The build completion does emit an event over ZeroMQ and Nodepool garbage collect the instance as expected.

Plan B: get Zuul to run the job on a timer. But that is a bit scarier.

Change 338179 had a related patch set uploaded (by Hashar):
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

https://gerrit.wikimedia.org/r/338179 is an implementation of plan B. The commit message explains it all but I dont think it is going to work properly.

Change 338179 abandoned by Hashar:
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

This was notably to ensure Nodepool was working reliably. At the time Chase went with an end to end test on WMCS infra to ensure that images were bootable and some monitoring on that side which addressed the concern.

The infra has changed since then and we at least have some alerts when the queue grows up. That is sufficient.