Track and graph mean time to merge
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	greg
	Jul 16 2014, 5:14 PM

Description

We should track the mean time to merge, probably per repo?, so we can see trends and the effects of changes to eg test coverage or hardware changes.

Version: wmf-deployment
Severity: enhancement

Details

Reference: bz68114

Related Objects
Search...

Status	Assigned	Task
Resolved	greg	T107905 Create KPIs for #releng-201516-Q2
Resolved	hashar	T108750 Implement "Jenkins/Zuul queue wait" KPI
Resolved	hashar	T70113 Alert when Zuul/Gearman queue is stalled
Resolved	hashar	T70114 Track and graph mean time to merge

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:37 AM

• bzimport added a project: Continuous-Integration-Infrastructure.

• bzimport set Reference to bz68114.

• bzimport added a subscriber: Unknown Object (MLST).

greg created this task.Jul 16 2014, 5:14 PM

Created attachment 15992
Mean time spent in Zuul/Jenkins by Mediawiki core changes after they are merged

Zuul reports a bunch of metrics to Graphite over statsd. The metrics are described at : http://ci.openstack.org/zuul/statsd.html#metrics

One of them is the time spent by a change in a the queue (which is Zuul overhead + the time to build all the jobs).

We could graph somewhere the meantime of jobs triggered after a merge. Though I am not sure how helpful it is going to be. The metric is zuul.pipeline.postmerge.mediawiki.core.resident_time.mean

The attached graphs represents the last three months. The URL is http://graphite.wikimedia.org/render/?width=856&height=600&_salt=1405949992.208&from=-3months&target=zuul.pipeline.postmerge.mediawiki.core.resident_time.mean

The time to run jobs depends on number of jobs executing on the server. Zuul/Jenkins would hold them until a slot is free up. So that is not very representative.

Attached:

mwcore-postmerge-zuul-mean-time.png (600×856 px, 31 KB)

• demon unsubscribed.Dec 16 2014, 6:05 PM

Krinkle moved this task from Untriaged to Backlog on the Continuous-Integration-Infrastructure board.Apr 14 2015, 1:37 AM

Krinkle unsubscribed.Aug 20 2015, 10:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2015, 10:57 PM

Krinkle moved this task from Backlog to Next on the Continuous-Integration-Infrastructure board.Aug 20 2015, 10:57 PM

This is probably less of a priority than these two other (not yet implemented) KPIs:

The resident_time metric reported by Zuul is the time delta between the change entering a pipeline and it leaving it (regardless of success). So that is related.

We have years of history! A couple of examples over the last two years for the gate-and-submit pipeline:

https://graphite.wikimedia.org/render/?width=960&height=563&_salt=1440446561.166&from=-2years&connectedLimit=&drawNullAsZero=true&lineMode=connected&target=zuul.pipeline.gate-and-submit.resident_time.upper

The same solely for mediawiki/core https://graphite.wikimedia.org/render/?width=960&height=563&_salt=1440446658.234&from=-2years&connectedLimit=&drawNullAsZero=true&lineMode=connected&target=zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper

I am not very confident in graphite/statsd aggregation, but that gives a rough idea.

You can even find out something happened early April 2015 that slowed down the mediawiki-core gate which went from less than 200 seconds to 500 seconds:

That is due to the switch from a real server + sqlite to labs instance + mysql.

I am pretty sure the mean/lower/upper aggregate are wrong in Graphite. They seem to always use average for consolidation :-(

hashar mentioned this in T108751: Implement "CI overhead" KPI.Aug 25 2015, 4:29 PM

hashar mentioned this in T108750: Implement "Jenkins/Zuul queue wait" KPI.

• mmodell subscribed.Aug 25 2015, 4:34 PM

Additional note: I am not sure what resident_time is. Potentially that is the different between enqueue and dequeue time. If a change is reenqueued in the gate-and-submit pipeline, that might well be a new item and thus the enqueue time would be reset ...

I looked a bit more at it. When a change enter a Zuul pipeline, the time is saved. When the change leave the pipeline (the jobs are completed and the job reported, the change is dequeued ...) Zuul emits a timer metric to statsd in miliseconds.

For the gate-and-submit pipeline, the timer is available in Graphite under zuul.pipeline.gate-and-submit.resident_time. Statsd buffer the metrics over a minute and report back the lower and upper value (potentially mean, stddev, percentiles and median as well but that apparently does not work).

By scaling the value by 0.001 we can thus get the min/max time in seconds changes takes over a period of time.

A lame experiment over 3 months:

https://graphite.wikimedia.org/render/?width=1000&height=600&_salt=1440534394.791&from=-3months&drawNullAsZero=true&tz=UTC&xFormat=%25Y-%25m-%25d&minXStep=&vtitle=Seconds&vtitleRight=Count&yMinLeft=0&yUnitSystem=si&yDivisors=&logBase=&title=Zuul%20gate%20processing%20speed&target=cactiStyle(alias(secondYAxis(zuul.pipeline.gate-and-submit.total_changes.count)%2C%22Changes%20in%20gate%22))&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.lower%2C0.001)%2C%22Fastest%20(seconds)%22))&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%22Slowest%20(seconds)%22))&target=summarize(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%221d%22%2C%22max%22)&target=summarize(scale(zuul.pipeline.gate-and-submit.resident_time.lower%2C0.001)%2C%221d%22%2C%22min%22)

We might care about how fast the gate process job, but for the sake of an exercise lets look at the slowness which is what people usually complains about. The next graph shows over a month:

in blue the slowest time roughly per hour
the green line are the daily slowest times

The worth is 2480 seconds or 40 minutes around August 20th, I can probably track that specific change. Maybe a percentile would be more useful.

https://graphite.wikimedia.org/render/?width=1045&height=438&_salt=1440535633.748&from=-1months&drawNullAsZero=true&tz=UTC&xFormat=%25Y-%25m-%25d&minXStep=&vtitle=Seconds&vtitleRight=Count&yMinLeft=0&yUnitSystem=si&yDivisors=&logBase=&title=Zuul%20gate%20processing%20speed&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%22Slowest%20(seconds)%22))&target=cactiStyle(alias(summarize(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%221d%22%2C%22max%22)%2C%20%22Daily%20max%20(seconds)%22))

conclusion

It seems Zuul provides potentially useful metrics. I would need help to figure out what we really want to measure / track / report.

A change staying 40 minutes could be because it had to be rerun a few time if changes ahead of it in the gate queue failed for some reason. Zuul dequeue the falling change and rebuild all the changes behind again.

The mediawiki core job that runs PHPUnit under Zend, takes 7-10 minutes IIRC (that is available in zuul statsd), making it faster would definitely improve this metric.

hashar added a parent task: T108750: Implement "Jenkins/Zuul queue wait" KPI.Aug 25 2015, 8:56 PM

Or in Grafana: https://grafana.wikimedia.org/#/dashboard/db/releng-zuul

I assigned to Antoine to run point, but now I'm unassigning. I'll let @hashar, @dduvall, and @zeljkofilipin figure it out :)

hashar mentioned this in T113501: mwext-mw-selenium is slowing down gate-and-submit pipeline.Sep 23 2015, 7:50 PM

Now at https://grafana.wikimedia.org/dashboard/db/releng-kpis?panelId=2&fullscreen

greg closed this task as Resolved.Dec 2 2015, 9:05 PM

greg added a project: OKR-Work.Jan 13 2016, 8:32 PM

greg mentioned this in T139771: Identify metric (or metrics) that gives a useful indication of user-perceived (Wikimedia developer) service of CI.Jul 8 2016, 5:18 PM

hashar moved this task from Next to Done on the Continuous-Integration-Infrastructure board.Jun 4 2021, 9:52 AM

	F14483: mwcore-postmerge-zuul-mean-time.png
	Nov 22 2014, 3:37 AM

	F2362837: render-5.png
	Aug 25 2015, 8:54 PM

	F2362908: render-6.png
	Aug 25 2015, 8:54 PM

	F2332350: render-4.png
	Aug 24 2015, 8:21 PM

Track and graph mean time to mergeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Track and graph mean time to merge
Closed, ResolvedPublic
Actions

Related Objects
Search...