Create a controlled and ongoing CI pipeline test job that we can alert on
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• chasemp
	Feb 14 2017, 1:43 PM

Description

Something in the spirit of https://gerrit.wikimedia.org/r/#/c/336413/ that would allow us to reason on outages/issues that are not obvious from watching markers of ongoing adhoc operations such as https://gerrit.wikimedia.org/r/#/c/335373/.

I would think something like a scheduled submission to the pipeline where the "vote" is a response back to the initiating mechanism for success or failure of some playbook of automated test conditions. Possibly run directly from the jenkins server itself.

Details

	Subject	Repo	Branch	Lines +/-
	(WIP) Timed build from Zuul	integration/config	master	+19 -0

Customize query in gerrit

Related Objects

Mentioned In: T70113: Alert when Zuul/Gearman queue is stalled

Event Timeline

• chasemp created this task.Feb 14 2017, 1:43 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2017, 1:43 PM

• chasemp mentioned this in T70113: Alert when Zuul/Gearman queue is stalled.Feb 14 2017, 1:43 PM

Paladox added a project: Continuous-Integration-Infrastructure.Feb 14 2017, 3:02 PM

Paladox subscribed.

hashar subscribed.Feb 16 2017, 7:34 PM

Ok here the random crap idea.

In Nodepool define a new label attached to an image and with a number of ready instance set to zero.

Craft a job on Jenkins that run on some not Nodepool slave. Make it spawn a child job that runs on the label defined above. Have the parent to timeout after X minutes and report back to IRC / email whatnot.

The gotcha: if the job is triggered by the Jenkins scheduler, I don't think Nodepool will notice there is demand for the associated label. Nodepool checks the demand by asking the Zuul gearman server. Gotta verify that part.

Tested. A job scheduled directly by Jenkins bypass the Zuul gearman server as expected and thus Nodepool cant find there is demand for it. The build completion does emit an event over ZeroMQ and Nodepool garbage collect the instance as expected.

Plan B: get Zuul to run the job on a timer. But that is a bit scarier.

Change 338179 had a related patch set uploaded (by Hashar):
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

gerritbot added a project: Patch-For-Review.Feb 16 2017, 9:29 PM

https://gerrit.wikimedia.org/r/338179 is an implementation of plan B. The commit message explains it all but I dont think it is going to work properly.

Change 338179 abandoned by Hashar:
(WIP) Timed build from Zuul

https://gerrit.wikimedia.org/r/338179

hashar edited projects, added Wikimedia-Incident; removed Patch-For-Review.May 6 2019, 6:12 PM

hashar moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

This was notably to ensure Nodepool was working reliably. At the time Chase went with an end to end test on WMCS infra to ensure that images were bootable and some monitoring on that side which addressed the concern.

The infra has changed since then and we at least have some alerts when the queue grows up. That is sufficient.

Create a controlled and ongoing CI pipeline test job that we can alert onClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Create a controlled and ongoing CI pipeline test job that we can alert on
Closed, DeclinedPublic
Actions