Page MenuHomePhabricator

Implement "Jenkins uptime" KPI
Closed, DeclinedPublic


Goal: none of that

Purpose: jenkins being down is bad

How: not sure...

Event Timeline

greg created this task.Aug 11 2015, 11:52 PM
greg raised the priority of this task from to Needs Triage.
greg updated the task description. (Show Details)
greg added a project: Release-Engineering-Team.
greg added subscribers: greg, Aklapper.
hashar triaged this task as Low priority.Aug 26 2015, 10:33 AM
hashar added a subscriber: hashar.

Jenkins is only a piece of the tool chain and it is barely down (though I have no metric to backup that claim).

The most troubling issues we encountered over a year were:

  • Gerrit lagging out on some DB reads. Causes mis failures, though it is pretty rare
  • Beta cluster jobs being deadlocked, got fixed via upstream.
  • Zuul not retriggering enqueued jobs when the Gearman server died (I fixed it back in October 2014).

I think that KPI should be renamed to measuring the whole CI toolchain works properly. Zuul has support to cron a job (say every 5 minutes), we can then emit some stat reporting the time to run the dummy job and whether it worked. In case of failure, I am not sure what is going to be Zuul behavior, I suspect it will enqueue a job every 5 minutes and run them all whenever the service is back. Might clutter the stats.

From our team meeting yesterday, I guess this is a low priority for now.

greg moved this task from INBOX to Backlog on the Release-Engineering-Team board.Sep 10 2015, 4:36 AM

Maybe later? I dont think Jenkins uptime itself is much important, it has been pretty robust over the years beside a few incident including the host machine going dead for a day. I dont think Jenkins is the most crucial part and we would need some kind of full stack test to assert the whole stack works properly.

greg closed this task as Declined.May 3 2017, 10:55 PM

We have monitoring for zuul already and that gets to the "business purpose" behind the idea of this proposal. Not needed anymore.