Page MenuHomePhabricator

Alert when Zuul/Gearman queue is stalled
Open, MediumPublic

Description

Exact limit we can bikeshed/adjust as needed, but, alerting on the fact that merges are taking "too long" is needed. We currently just use the "someone needs to complain on IRC" method, which isn't good customer service.

Icinga is probably the right place for the alert.

A representation of the Zuul Gearman queue:

Details

Reference
bz68113
Related Gerrit Patches:

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:37 AM
bzimport set Reference to bz68113.
bzimport added a subscriber: Unknown Object (MLST).
greg created this task.Jul 16 2014, 5:12 PM

Created attachment 15993
Gearman queue metrics

The Zuul/Jenkins interaction is handled over German. The server is embedded in Zuul and reports metrics to Graphite over statsd.

The attached graph shows over a month:

  • blue: overall size of Gearman functions pending/running
  • green: functions running (i.e. a Jenkins job is proceeding)
  • red: functions waiting

The graph URL is:
http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1405950653.273&from=-1month&target=zuul.geard.queue.total.value&target=zuul.geard.queue.running.value&target=zuul.geard.queue.waiting.value

Attached:

Created attachment 15994
Gearman queue on July 15th 2014

Attached is the states of Gearman queues on July 15th

  • blue: overall size of Gearman functions pending/running
  • green: functions running (i.e. a Jenkins job is proceeding)
  • red: functions waiting

The huge spike on July 15th seems to have been some kind of incident which is what probably caused this bug to be filled.

We could use Icinga check_graphite plugin or some confident band to detect there are too many jobs in the queue and alert whenever it is the case.

I am not sure what went wrong there. Most probably Jenkins jobs were no more properly registered / runnable in Gearman server which caused Gearman to enqueue then and wait for a worker (i.e. Jenkins) to execute the job.

I guess we could alarm whenever zuul.geard.queue.waiting.value is above 5 for more than 15 minutes.

Attached:

Note: a similar graph is available on our Zuul homepage at https://integration.wikimedia.org/zuul/ . That is the graph titled "Zuul Geard job queue (8 hours)".

It definitely showed the exact same graphs I have attached previously: a huge spike of waiting jobs.

demon removed a subscriber: demon.Dec 16 2014, 6:05 PM
Krinkle renamed this task from Alert when time to merge (from +2 in Gerrit to merged in git) exceeds a known bad limit to Alert when Zuul/Gearman queue is stalled.Apr 14 2015, 2:47 PM
Krinkle updated the task description. (Show Details)
Krinkle added a project: observability.
Krinkle set Security to None.
Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.
Krinkle removed a subscriber: Unknown Object (MLST).

Per @hashar, once we determine a good metric and threshold; we should be able to use Shinken to monitor the Graphite query.

Paladox added a subscriber: Paladox.Jun 3 2016, 8:31 PM

@greg or @hashar can we set this as a high priority since it is annoying for jobs to be left there running for hours before anyone notices it froze please.

hashar changed the task status from Open to Stalled.Jul 20 2016, 11:30 AM

@Paladox the problem is determining a metric to detect that processing of jobs is stall. One can query zuul-server RPC:

$ zuul show running-jobs
+-------------------------------------+---------+----------------+---------+
|               Job Name              |  Worker |   Start Time   |  Result |
+-------------------------------------+---------+----------------+---------+
|     operations-puppet-tox-jessie    | Unknown | 34 seconds ago | SUCCESS |
|             rake-jessie             | Unknown | 34 seconds ago | SUCCESS |
| operations-puppet-puppetlint-strict | Unknown | 34 seconds ago |   None  |
|       operations-puppet-typos       | Unknown | 34 seconds ago | SUCCESS |
|        operations-puppet-doc        | Unknown | 15 seconds ago |   None  |
+-------------------------------------+---------+----------------+---------+

Analyze the public Zuul status page at https://integration.wikimedia.org/zuul/status.json which has for each job the time it started and actual launch time.

Maybe also look at the median time a change stay in the Zuul queues. In Graphite that would be for example sumSeries(zuul.pipeline.*.resident_time.median). Over the last past 5 weeks:

Looks like determining proper metrics is T139771: Identify metric (or metrics) that gives a useful indication of user-perceived (Wikimedia developer) service of CI. Once we have them, it is fairly trivial to add monitoring probes for each of them.

greg added a subscriber: thcipriani.Nov 4 2016, 5:21 PM

@hashar / @thcipriani: All the blockers are closed, do we have the needed data to alert on now?

hashar changed the task status from Stalled to Open.Nov 4 2016, 5:53 PM

An alarm against sumSeries(zuul.pipeline.*.resident_time.median) might work, though the metric is reported once the change is unblocked.

https://graphite.wikimedia.org/render/?width=974&height=585&_salt=1478281873.483&target=sum(zuul.pipeline.*.resident_time.median)&from=-7days

The gearman queue and gearman workers on https://grafana.wikimedia.org/dashboard/db/releng-zuul are good candidates as well. The first is the queue not being processed fast enough, the second would indicate lack of slaves to run on.

greg added a subscriber: chasemp.Feb 14 2017, 12:11 AM

The gearman queue and gearman workers on https://grafana.wikimedia.org/dashboard/db/releng-zuul are good candidates as well. The first is the queue not being processed fast enough, the second would indicate lack of slaves to run on.

Looking at the last 60 days for the queue:

the big outlier was on 2017-01-31:


(SAL Entry): This was an OpenStack / WMF Labs issue. @chasemp wrote a monitoring script based on the outage. @chasemp what was it that you ended up monitoring, exactly?

@chasemp has added monitoring of Nodepool instances state age https://gerrit.wikimedia.org/r/#/c/335373/ Eg alarm when an instance has been in a delete state for more than 6 minutes which usually indicates the OpenStack API is unable to honor the deletion requests. There is also an end to end test / monitor for Nova https://gerrit.wikimedia.org/r/#/c/336413/ :-}


An alarm against sumSeries(zuul.pipeline.*.resident_time.median) might work, though the metric is reported once the change is unblocked.

That represent all long a change has been in Zuul from the time the event is received until the change exits Zuul (it is reported back in Gerrit). But the time is reported upon completion of the event so that is only useful after the fact.

The gearman queue and gearman workers on https://grafana.wikimedia.org/dashboard/db/releng-zuul are good candidates as well. The first is the queue not being processed fast enough, the second would indicate lack of slaves to run on.

That is shown on the spike Greg captured a screenshot for:

The big red surge is alias(consolidateBy(zuul.geard.queue.waiting, 'sum'), 'Waiting') which is the number of Gearman work requests that are waiting in the queue.

Change 337552 had a related patch set uploaded (by Hashar):
zuul: monitor Gearman queue growing out of control

https://gerrit.wikimedia.org/r/337552

chasemp added a comment.EditedFeb 14 2017, 1:39 PM

(SAL Entry): This was an OpenStack / WMF Labs issue. @chasemp wrote a monitoring script based on the outage. @chasemp what was it that you ended up monitoring, exactly?

@chasemp has added monitoring of Nodepool instances state age https://gerrit.wikimedia.org/r/#/c/335373/ Eg alarm when an instance has been in a delete state for more than 6 minutes which usually indicates the OpenStack API is unable to honor the deletion requests. There is also an end to end test / monitor for Nova https://gerrit.wikimedia.org/r/#/c/336413/ :-}

Pretty much as @hashar describes with the note that this is a poor proxy for health of the pipeline upstream from nodepool (where I see zuul/gearman). When nodepool 'loses' a node, or some other state oddity occurs, I'm trying to surface it. That is my best approximation at the moment for the health of the nodepool<->openstack api interaction. I did extend a bit beyond that by adding the alarm for used state which is really just throwing a fail on 'test taking too long'. https://gerrit.wikimedia.org/r/#/c/335373/ is still adhoc atm and I have been trying to find time to make it an ongoing service. Once this is stable we have plans to make it page us (Labs or WMCS).

One thing I want to note is when making the nodepool alert actionable it seemed no one is paged (directly) on CI issues or outages. I think historically (last 2 years or so) often one of the Labs folks responds off hours based on irc but if we are serious about this we should consider paging Release-Engineering-Team folks (and/or both). I know that's a bigger statement than this task, and outside of my purview, but it's been an ongoing narrative I think. I only bring it up as this task has 'alert' in the title but I don't see who or how being discussed.

As a precursor to my full stack nova-api test I did chat with @thcipriani a bit about the possibility of a similar mechanism that continually validates the CI pipeline as a controlled test(s). Usually our problem is: we know things are broken but not since when or at what stage. A huge hurdle here is since all this processing is adhoc there is no definite expected activity and we are farming for test cases and last working state. This is really similar for the nova-api (general instance management) case.

edit: I made T158054 as I had promised previously to do so and failed :)

Krinkle removed a subscriber: Krinkle.
greg added a comment.Feb 14 2017, 7:29 PM

One thing I want to note is when making the nodepool alert actionable it seemed no one is paged (directly) on CI issues or outages. I think historically (last 2 years or so) often one of the Labs folks responds off hours based on irc but if we are serious about this we should consider paging Release-Engineering-Team folks (and/or both). I know that's a bigger statement than this task, and outside of my purview, but it's been an ongoing narrative I think. I only bring it up as this task has 'alert' in the title but I don't see who or how being discussed.

We (RelEng) have not historically been an "on-call" team. We do wake Antoine up or call him during dinner at times when things fall over that the rest of the team can't figure out but we don't have any paging system in use, at least for CI.

In production we have Chad getting notices for Gerrit, but I believe there is a limitation with our icinga config that does not allow him to get only Gerrit-related SMSes, not all the SMSes (there was a task about this....).

I am in no way trying to push over a first domino here, I swear. The logistics of becoming an on-call team would be pretty big I think. So I hope you know I'm not trying to take a cheap shot there or anything of the kind. My main thinking is: what good are alerts that no one sees, and if someone is meant to see them how is that meant to happen now? If that rabbit hole leads to a serious on-call conversation that has to be OK, but I feel like I have to ask in good faith. It's cool if that's IRC and CI support is best effort or something of the kind, but we (labs people?) are trying to be really deliberate here in the best sense.

greg added a comment.Feb 14 2017, 8:01 PM

Right right :) For us I think a good first pass would be email+IRC alerts and a best effort tier SLA.

Change 337552 merged by Dzahn:
zuul: monitor Gearman queue growing out of control

https://gerrit.wikimedia.org/r/337552

After a night of monitoring, the alarms reported on IRC and the [[ associated graph | https://grafana-admin.wikimedia.org/dashboard/db/zuul-gearman?panelId=20&fullscreen ]]

[01:03:55] PROBLEM Anomaly detected: 10 data above and 0 below the confidence bounds
[01:10:55] PROBLEM Anomaly detected: 10 data above and 7 below the confidence bounds
[01:13:55] RECOVERY No anomaly detected
[02:31:56] PROBLEM Anomaly detected: 10 data above and 0 below the confidence bounds
[02:37:55] RECOVERY No anomaly detected
[02:40:55] PROBLEM Anomaly detected: 16 data above and 0 below the confidence bounds
[02:46:55] PROBLEM Anomaly detected: 16 data above and 0 below the confidence bounds
[03:06:55] RECOVERY No anomaly detected
[03:57:55] PROBLEM Anomaly detected: 14 data above and 0 below the confidence bounds
[04:00:55] PROBLEM Anomaly detected: 14 data above and 0 below the confidence bounds
[04:15:55] PROBLEM Anomaly detected: 29 data above and 0 below the confidence bounds
[05:03:55] PROBLEM Anomaly detected: 10 data above and 0 below the confidence bounds
[05:11:55] PROBLEM Anomaly detected: 11 data above and 0 below the confidence bounds
[05:16:55] PROBLEM Anomaly detected: 11 data above and 0 below the confidence bounds
[06:10:55] RECOVERY No anomaly detected
[06:28:55] PROBLEM Anomaly detected: 13 data above and 0 below the confidence bounds
[06:32:55] PROBLEM Anomaly detected: 18 data above and 0 below the confidence bounds
[07:11:55] PROBLEM Anomaly detected: 28 data above and 2 below the confidence bounds
[07:12:55] RECOVERY No anomaly detected

A zoom from 5:00 to 7:30:

View from midnight:

It is off somehow. Maybe because the confidence band is null while the waiting function count is 0.

The code is faulty. Whenever the json return a null value, python makes it a None and 0 is superior than None

>>> json.loads('[null]')
[None]
>>> 0 > None
True

Hence if the upper band is null, a data point at 0 would be considered above the upper band and trigger a fault.


Given we invoke:

./check_graphite -U https://graphite.wikimedia.org check_anomaly --check_window 30 -W 5 -C 10 zuul.geard.queue.waiting --over

That invokes https://graphite.wikimedia.org/render?format=json&target=zuul.geard.queue.waiting&target=holtWintersConfidenceBands%28zuul.geard.queue.waiting%2C+5%29

The code looks like:

modules/nagios_common/files/check_commands/check_graphite
class Anomaly(GraphiteCheck):
    def parse_result(self, result):
        my_slice = self.check_window * -1
        # 30 data points for zuul.geard.queue.waiting
        measures = result[0]['datapoints'][my_slice:]
        # 30 data points for the Holt-Winters upper confidence band
        upperbound = result[2]['datapoints'][my_slice:]

        for i in xrange(self.check_window):
            data, time = measures[i]
            u = upperbound[i][0]
            # Given:
            # - data point is 0
            # - upper band is None
            # data >= u  ==> True
            # And thus the data point at 0 is considered above the upper band!
            if data >= u:
                datapoints['higher'].append((time, data))
            else:
                datapoints['ok'].append((time, data))

Hacking the script a bit

+         for upperconf in result[2]['datapoints']:
+             print time.strftime("| %Y-%m-%d %H:%M:%S | ", time.gmtime(upperconf[1])), upperconf[0]

The upper bound is actually NEGATIVE sometime:

2017-02-16 06:16:000.00313054844844
2017-02-16 06:17:00-0.00130196224777
2017-02-16 06:18:00-0.00527630827897
2017-02-16 06:19:00-0.0088387136914
2017-02-16 06:20:00-0.0120307142524
2017-02-16 06:21:00-0.0148896353821
2017-02-16 06:22:00-0.017449021169
2017-02-16 06:23:00-0.0197390194916
2017-02-16 06:24:00-0.0217867277515
2017-02-16 06:25:00-0.0236165032579
2017-02-16 06:26:00-0.0252502418915
2017-02-16 06:27:00-0.0267076282988
2017-02-16 06:28:00-0.0280063605374
2017-02-16 06:29:00-0.0291623517888
2017-02-16 06:30:00-0.0301899114897
2017-02-16 06:31:00-0.0311019079894
2017-02-16 06:32:00-0.0319099146247
2017-02-16 06:33:00-0.03262434091
2017-02-16 06:34:00-0.0332545503647
2017-02-16 06:35:00-0.0326098453158
2017-02-16 06:36:00-0.0317809855706
2017-02-16 06:37:00-0.031021737407
2017-02-16 06:38:00-0.0303256433479
2017-02-16 06:39:0022.9703131552

Change 338095 had a related patch set uploaded (by Hashar):
check_graphite anomaly option to set minimum upper band

https://gerrit.wikimedia.org/r/338095

https://gerrit.wikimedia.org/r/#/c/338095/ lets one specific to check_graphite a minimum value the Holt-Winters confidence upper band can have. The patch sets it to 1 that will prevent the flapping. Can even be used as a minimum threshold under which to NOT alert :}

A few hours ago we had an outage and the queue pilled up quickly from roughly 17:30 until 19:00. The Icinga alarm were:

18:27:10 UTC problem
18:33:10 UTC recovery
19:12:10 UTCproblem
19:25:10 UTC recovery

The queue overflowed the confident graph from 18:17 to 18:23 and again from 18:54 to 19:09. Looking at the confidence band graph one can see the upper bound closely follow the raise in the queue so the Holt-Winters forecast is all happy:

Looking at the actual queue it is obvious that it is overloaded:

I guess I dont know how to use Holt-Winters forecast used by monitoring::graphite_anomaly.

Yesterday @Dzahn proposed to just use a threshold of waiting jobs behing over some arbitrary value. That is straightforward and way more reliable.

Maybe just doing a ratio of jobs waiting / jobs running would be sufficient. https://grafana.wikimedia.org/dashboard/db/zuul-gearman?from=now-6h&to=now-5m&panelId=21&fullscreen

The blue bar is waiting/running on the right y-axis. a ratio of 10 or so would have triggered the alarm and recovered when the queue started processing again. It is not perfect though, if we have a queue of 300 jobs but properly run 15 of them, the ratio is 20 although the service is processing again.

scheduled a downtime in Icinga for one week, to disable notifications while we are still working on it. it will start to notify again on Feb 23rd

Change 341402 had a related patch set uploaded (by hashar):
[operations/puppet] Change Zuul Gearman alarm to a simple threshold

https://gerrit.wikimedia.org/r/341402

Change 341402 merged by Dzahn:
[operations/puppet] Change Zuul Gearman alarm to a simple threshold

https://gerrit.wikimedia.org/r/341402

hashar closed this task as Resolved.Mar 7 2017, 10:36 AM
hashar claimed this task.

There is a now a basic threshold alerting when there are more than a hundred functions waiting for 5+ minutes.

Change 338095 abandoned by Hashar:
check_graphite anomaly option to set minimum upper band

Reason:
Would probably still be worth having, but that is no more needed for Zuul/Gearman monitoring.

https://gerrit.wikimedia.org/r/338095

Change 551347 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] Revert "zuul: Remove zuul Gearman queue alert"

https://gerrit.wikimedia.org/r/551347

Krinkle reopened this task as Open.Nov 17 2019, 10:54 PM

Change 551347 merged by Dzahn:
[operations/puppet@production] Revert "zuul: Remove zuul Gearman queue alert"

https://gerrit.wikimedia.org/r/551347

The alarm got removed via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550943/ for some reason ...

It wasn't really actionable (or if it was there was no runbook in wikitech for anyone to react to it). It's been re-added, but we would love a runbook to be able to react to it.