Page MenuHomePhabricator

CI monitoring to detect flapping tests, especially in unrelated gated extensions
Open, Needs TriagePublic

Description

How to do this is an open question. Some thoughts:

  • Detect a gate-and-submit failure coming after a V+2
  • Report the number of "recheck" change comments. We might also compare the test results before and after recheck, and any difference counts as flapping.
  • Somehow track individual failing test names, and show a heat map of the most frequent failures.

Event Timeline

awight created this task.Jun 6 2019, 10:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2019, 10:33 AM
kostajh added a subscriber: kostajh.Jun 7 2019, 2:17 AM
hashar added a subscriber: hashar.Jun 7 2019, 8:01 AM

OpenStack had a similar need and they wrote a reporter which collect and analyze tests and create a nice report.

https://www.elastic.co/blog/openstack-elastic-recheck-powered-elk-stack
https://docs.openstack.org/infra/elastic-recheck/readme.html#idea

The rough flow from 2014 (by Sean Dague)

So what they do is that all the INFO logs and test results are send to an ElasticSearch cluster and then analyzed by an adhoc tool. We also have a very old task about collecting logs/tests into ElasticSearch T78705 , Releng talked about it recently but I am not sure we made any progress on that front (others might know better).

I am not suggesting to adopt exactly that elasticresearch thing, but we should at least get inspiration from it?

hashar added a comment.Jun 7 2019, 8:18 AM

Another thought: Zuul pipelines can have several reporters. The only one we use for now is the gerrit reporter which send a review back to Gerrit (and eventually with labels vote and submitting the change). There is another one to report over smtp.

Zuul 2.5.2 (we have 2.5.1) comes with some MySQL reporter:

https://opendev.org/zuul/zuul/src/tag/2.5.2/doc/source/reporters.rst#sql
https://opendev.org/zuul/zuul/src/tag/2.5.2/zuul/reporter/sql.py

We can probably borrow it or write an ElasticSearch reporter of some sort.

The "recheck bot" is an interesting twist--I like the idea of automatically rechecking if there was an external error e.g. network glitch, but it makes me uncomfortable to think about rechecking due to flapping tests. What would be nicer is a fine-grained mask for interpreting test results, basically a way to quickly flag certain tests as broken at the CI level, without having to edit and merge code. This would let us provisionally V+2 patches only affected by flappers.

A database of flapping tests sounds awesome, and probably an appropriate technology considering that we have to correlate failures over long periods of time in order to see the patterns.