Page MenuHomePhabricator

Identify metric (or metrics) that gives a useful indication of user-perceived (Wikimedia developer) service of CI
Closed, ResolvedPublic

Description

Purpose: We need a single (or couple) metrics that will allow us to know if, generally, our developers are experiencing longer wait times before merges/test completion.

Where it will be used: This will hopefully inform the size of our pool of nodepool instances. We'll know when we need to increase it or if we can safely decrease it to save wmflabs capacity. This should be able to tell us something like "If we reduce the pool by 5 there will be no user-noticeable impact except for during our 3 busiest hours of the day." Or, conversely, we'll be able to tell how much developer time we saved by increasing our pool size (by comparing weekly data pre/post the change).

See also:

Related Objects

Event Timeline

greg created this task.Jul 8 2016, 5:11 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 8 2016, 5:11 PM
greg updated the task description. (Show Details)Jul 8 2016, 5:18 PM
hashar added a subscriber: hashar.Jul 18 2016, 4:44 PM
Andrew added a subscriber: Andrew.

From T70113#2479567:

Maybe also look at the median time a change stay in the Zuul queues. In Graphite that would be for example sumSeries(zuul.pipeline.*.resident_time.median). Over the last past 5 weeks:

greg moved this task from INBOX to Next on the Release-Engineering-Team board.Jul 21 2016, 8:55 PM

@hashar where on that is the reduction in concurrent jobs for nodepool? I don't see anything there that screams 'turn me back up!'

Inspired by the graph above, I created a Graph panel at https://grafana.wikimedia.org/dashboard/db/releng-kpis?panelId=5&fullscreen

It reports the time a job waits from when it is known to Zuul until it is starts running. The filter restricts it to Jenkins labels ci-* which are the ones used for Nodepool instances.

The spikes are the conjonction of San Francisco waking up, Europe being active, the SWAT and deploys slot. That should correlate with the number of jobs in Gearman waiting (though that is for any job including Zuul internal jobs) and the max time to launch an instance https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen

The first Graph based on a moving average of one hour might be a good high level representation of the perceived response time. I have asked Release-Engineering-Team internal mailing list for more feedback / tuning.

The https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen second graph is on a per label basis.

For reference, the Graphite query for the first Graph is:

alias(movingAverage(consolidateBy(maxSeries(zuul.pipeline.*.label.ci-*.wait_time.upper), 'max'), '1hour'), 'One hour moving average')
alias(
     movingAverage(
          consolidateBy(
               maxSeries( zuul.pipeline.*.label.ci-*.wait_time.upper),
          'max'),
     '1hour'),
'One hour moving average'
)

@hashar where on that is the reduction in concurrent jobs for nodepool? I don't see anything there that screams 'turn me back up!'

Any spike on the red graphs shows a job waiting minutes for a node to spawn and be pooled in Jenkins. Some are nearly instants because a node is available in the pool and consumed right away, some have to wait for an instance to spawn which takes roughly a minute all included. Anything above that minute is the pool being starved.

We have stopped migrating jobs from the permanent slaves to the Nodepool ones pending capacity.

Another thing we are missing is to have Nodepool to report metrics to statsd (T111496) specially the number of instances per state (eg: 5 ready, 3 building, 2 deleting). That will give you a better view of the state of the pool.

chasemp added a subscriber: thcipriani.EditedAug 12 2016, 10:35 PM

We had an outage for CI 2 night ago and during that we discovered that nodepool seems to be waiting only 1s before declaring build on a VM faulty and then issuing a delete and then eventually churning on its own quota limitations. This happened because we upped the timeout allowance for instance creation as we have larger and larger projects with relative rule sets. During debugging we also discovered issues with quota tracking and nodepool. Nova seems to have no clear idea on instance count for the project displaying greater than 32k instances, and were also fighting DNS leaks all over the place making it unclear what is and is not an actual CI instance. I have a suspicion this DNS leak issue is related to the rate and tolerance of instance creation and lost messaging on the part of nodepool/rabbitmq. I talked to @thcipriani and @greg briefly post incident due to the difficult nature of debugging through this. I do not believe we can't go any further with nodepool without addressing these issues. I believe we are in agreement about this overall. The DNS leaks cleanup we are doing periodically is getting way out of hand and seems like a canary for deeper issues.

edit: yuvi made an issue for increasing head room here T142877

hashar closed this task as Resolved.Nov 3 2016, 9:08 PM
hashar claimed this task.

We now have a lot more metrics to assert the behavior of the Zuul/Nodepool stack.

https://grafana.wikimedia.org/dashboard/db/nodepool

  • shows the state of the pool
  • the rate of queries to openstack API (tasks)
  • time for launch of an instance

https://grafana.wikimedia.org/dashboard/db/releng-zuul

  • has the max wait time before a change starts having jobs running. That covers Zuul internal overhead as well as availability of executors to run a build.

We are well covered now.