Page MenuHomePhabricator

CI monitoring to detect flapping tests, especially in unrelated gated extensions
Closed, DuplicatePublic


How to do this is an open question. Some thoughts:

  • Detect a gate-and-submit failure coming after a V+2
  • Report the number of "recheck" change comments. We might also compare the test results before and after recheck, and any difference counts as flapping.
  • Somehow track individual failing test names, and show a heat map of the most frequent failures.

Event Timeline

OpenStack had a similar need and they wrote a reporter which collect and analyze tests and create a nice report.

The rough flow from 2014 (by Sean Dague)

ER_ELK_flow.png (499×972 px, 63 KB)

So what they do is that all the INFO logs and test results are send to an ElasticSearch cluster and then analyzed by an adhoc tool. We also have a very old task about collecting logs/tests into ElasticSearch T78705 , Releng talked about it recently but I am not sure we made any progress on that front (others might know better).

I am not suggesting to adopt exactly that elasticresearch thing, but we should at least get inspiration from it?

Another thought: Zuul pipelines can have several reporters. The only one we use for now is the gerrit reporter which send a review back to Gerrit (and eventually with labels vote and submitting the change). There is another one to report over smtp.

Zuul 2.5.2 (we have 2.5.1) comes with some MySQL reporter:

We can probably borrow it or write an ElasticSearch reporter of some sort.

The "recheck bot" is an interesting twist--I like the idea of automatically rechecking if there was an external error e.g. network glitch, but it makes me uncomfortable to think about rechecking due to flapping tests. What would be nicer is a fine-grained mask for interpreting test results, basically a way to quickly flag certain tests as broken at the CI level, without having to edit and merge code. This would let us provisionally V+2 patches only affected by flappers.

A database of flapping tests sounds awesome, and probably an appropriate technology considering that we have to correlate failures over long periods of time in order to see the patterns.

hashar closed this task as a duplicate of T224673: Automate identifying flaky tests.

I have merged this task into the very similar T224673 which has some basic analysis based on number of recheck.