Page MenuHomePhabricator

Track and graph mean time to merge
Closed, ResolvedPublic

Description

We should track the mean time to merge, probably per repo?, so we can see trends and the effects of changes to eg test coverage or hardware changes.


Version: wmf-deployment
Severity: enhancement

Details

Reference
bz68114

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:37 AM
bzimport set Reference to bz68114.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 15992
Mean time spent in Zuul/Jenkins by Mediawiki core changes after they are merged

Zuul reports a bunch of metrics to Graphite over statsd. The metrics are described at : http://ci.openstack.org/zuul/statsd.html#metrics

One of them is the time spent by a change in a the queue (which is Zuul overhead + the time to build all the jobs).

We could graph somewhere the meantime of jobs triggered after a merge. Though I am not sure how helpful it is going to be. The metric is zuul.pipeline.postmerge.mediawiki.core.resident_time.mean

The attached graphs represents the last three months. The URL is http://graphite.wikimedia.org/render/?width=856&height=600&_salt=1405949992.208&from=-3months&target=zuul.pipeline.postmerge.mediawiki.core.resident_time.mean

The time to run jobs depends on number of jobs executing on the server. Zuul/Jenkins would hold them until a slot is free up. So that is not very representative.

Attached:

mwcore-postmerge-zuul-mean-time.png (600×856 px, 31 KB)

This is probably less of a priority than these two other (not yet implemented) KPIs:

The resident_time metric reported by Zuul is the time delta between the change entering a pipeline and it leaving it (regardless of success). So that is related.

We have years of history! A couple of examples over the last two years for the gate-and-submit pipeline:

https://graphite.wikimedia.org/render/?width=960&height=563&_salt=1440446561.166&from=-2years&connectedLimit=&drawNullAsZero=true&lineMode=connected&target=zuul.pipeline.gate-and-submit.resident_time.upper

The same solely for mediawiki/core https://graphite.wikimedia.org/render/?width=960&height=563&_salt=1440446658.234&from=-2years&connectedLimit=&drawNullAsZero=true&lineMode=connected&target=zuul.pipeline.gate-and-submit.mediawiki.core.resident_time.upper

I am not very confident in graphite/statsd aggregation, but that gives a rough idea.

You can even find out something happened early April 2015 that slowed down the mediawiki-core gate which went from less than 200 seconds to 500 seconds:

render-4.png (371×960 px, 56 KB)

That is due to the switch from a real server + sqlite to labs instance + mysql.

I am pretty sure the mean/lower/upper aggregate are wrong in Graphite. They seem to always use average for consolidation :-(

Additional note: I am not sure what resident_time is. Potentially that is the different between enqueue and dequeue time. If a change is reenqueued in the gate-and-submit pipeline, that might well be a new item and thus the enqueue time would be reset ...

I looked a bit more at it. When a change enter a Zuul pipeline, the time is saved. When the change leave the pipeline (the jobs are completed and the job reported, the change is dequeued ...) Zuul emits a timer metric to statsd in miliseconds.

For the gate-and-submit pipeline, the timer is available in Graphite under zuul.pipeline.gate-and-submit.resident_time. Statsd buffer the metrics over a minute and report back the lower and upper value (potentially mean, stddev, percentiles and median as well but that apparently does not work).

By scaling the value by 0.001 we can thus get the min/max time in seconds changes takes over a period of time.

A lame experiment over 3 months:

render-5.png (600×1 px, 129 KB)

https://graphite.wikimedia.org/render/?width=1000&height=600&_salt=1440534394.791&from=-3months&drawNullAsZero=true&tz=UTC&xFormat=%25Y-%25m-%25d&minXStep=&vtitle=Seconds&vtitleRight=Count&yMinLeft=0&yUnitSystem=si&yDivisors=&logBase=&title=Zuul%20gate%20processing%20speed&target=cactiStyle(alias(secondYAxis(zuul.pipeline.gate-and-submit.total_changes.count)%2C%22Changes%20in%20gate%22))&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.lower%2C0.001)%2C%22Fastest%20(seconds)%22))&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%22Slowest%20(seconds)%22))&target=summarize(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%221d%22%2C%22max%22)&target=summarize(scale(zuul.pipeline.gate-and-submit.resident_time.lower%2C0.001)%2C%221d%22%2C%22min%22)

We might care about how fast the gate process job, but for the sake of an exercise lets look at the slowness which is what people usually complains about. The next graph shows over a month:

  • in blue the slowest time roughly per hour
  • the green line are the daily slowest times

The worth is 2480 seconds or 40 minutes around August 20th, I can probably track that specific change. Maybe a percentile would be more useful.

render-6.png (438×1 px, 86 KB)

https://graphite.wikimedia.org/render/?width=1045&height=438&_salt=1440535633.748&from=-1months&drawNullAsZero=true&tz=UTC&xFormat=%25Y-%25m-%25d&minXStep=&vtitle=Seconds&vtitleRight=Count&yMinLeft=0&yUnitSystem=si&yDivisors=&logBase=&title=Zuul%20gate%20processing%20speed&target=cactiStyle(alias(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%22Slowest%20(seconds)%22))&target=cactiStyle(alias(summarize(scale(zuul.pipeline.gate-and-submit.resident_time.upper%2C0.001)%2C%221d%22%2C%22max%22)%2C%20%22Daily%20max%20(seconds)%22))

conclusion

It seems Zuul provides potentially useful metrics. I would need help to figure out what we really want to measure / track / report.

A change staying 40 minutes could be because it had to be rerun a few time if changes ahead of it in the gate queue failed for some reason. Zuul dequeue the falling change and rebuild all the changes behind again.

The mediawiki core job that runs PHPUnit under Zend, takes 7-10 minutes IIRC (that is available in zuul statsd), making it faster would definitely improve this metric.

greg removed hashar as the assignee of this task.
greg set Security to None.
greg added subscribers: dduvall, zeljkofilipin.

I assigned to Antoine to run point, but now I'm unassigning. I'll let @hashar, @dduvall, and @zeljkofilipin figure it out :)