Problem: Flaky integration and selenium tests harm developer productivity. While we don't, as far as I know, have a consistent strategy for how to deal with them, a good starting point would be to automatically identify these tests and track the frequency of their occurrence.
One idea would be to query the Gerrit API to look for occurrences of the word "recheck". Recheck isn't a 100% indicator of a flaky test since it can also be used for 1) checking a patch if the submitter isn't in the CI whitelist, 2) checking a patch if the Depends-On state changed. But we could start by having a bot look at "recheck" comments for patches which don't have "Depends-On" and the author is in the CI whitelist; for those patches the bot could create a phab task (if one doesn't exist) and tag it with the not-yet-created #flaky-test tag. The bot could also keep track of the frequency/volume of rechecks so we could see if our integration / selenium tests are getting less or more flaky over time.
recheck comment in the Gerrit Database until June 2018.
SELECT YEAR(written_on) as 'year', MONTH(written_on) as 'month', count(*) as 'rechecks' FROM change_messages WHERE message LIKE '%\nrecheck%' GROUP BY YEAR(written_on), MONTH(written_on);
| year | month | rechecks |
|---|---|---|
| 2012 | 11 | 15 |
| 2012 | 12 | 115 |
| 2013 | 1 | 73 |
| 2013 | 2 | 42 |
| 2013 | 3 | 53 |
| 2013 | 4 | 158 |
| 2013 | 5 | 41 |
| 2013 | 6 | 64 |
| 2013 | 7 | 12 |
| 2013 | 8 | 12 |
| 2013 | 9 | 31 |
| 2013 | 10 | 28 |
| 2013 | 11 | 19 |
| 2013 | 12 | 30 |
| 2014 | 1 | 40 |
| 2014 | 2 | 46 |
| 2014 | 3 | 25 |
| 2014 | 4 | 82 |
| 2014 | 5 | 141 |
| 2014 | 6 | 147 |
| 2014 | 7 | 151 |
| 2014 | 8 | 174 |
| 2014 | 9 | 144 |
| 2014 | 10 | 202 |
| 2014 | 11 | 190 |
| 2014 | 12 | 215 |
| 2015 | 1 | 338 |
| 2015 | 2 | 399 |
| 2015 | 3 | 442 |
| 2015 | 4 | 354 |
| 2015 | 5 | 238 |
| 2015 | 6 | 340 |
| 2015 | 7 | 277 |
| 2015 | 8 | 258 |
| 2015 | 9 | 526 |
| 2015 | 10 | 425 |
| 2015 | 11 | 336 |
| 2015 | 12 | 509 |
| 2016 | 1 | 444 |
| 2016 | 2 | 1058 |
| 2016 | 3 | 643 |
| 2016 | 4 | 328 |
| 2016 | 5 | 419 |
| 2016 | 6 | 360 |
| 2016 | 7 | 402 |
| 2016 | 8 | 426 |
| 2016 | 9 | 384 |
| 2016 | 10 | 228 |
| 2016 | 11 | 400 |
| 2016 | 12 | 312 |
| 2017 | 1 | 329 |
| 2017 | 2 | 314 |
| 2017 | 3 | 319 |
| 2017 | 4 | 259 |
| 2017 | 5 | 273 |
| 2017 | 6 | 289 |
| 2017 | 7 | 297 |
| 2017 | 8 | 306 |
| 2017 | 9 | 279 |
| 2017 | 10 | 315 |
| 2017 | 11 | 297 |
| 2017 | 12 | 407 |
| 2018 | 1 | 680 |
| 2018 | 2 | 600 |
| 2018 | 3 | 562 |
| 2018 | 4 | 429 |
| 2018 | 5 | 410 |
| 2018 | 6 | 101 |
In T225193#5242462 @hashar wrote:
OpenStack had a similar need and they wrote a reporter which collect and analyze tests and create a nice report.
https://www.elastic.co/blog/openstack-elastic-recheck-powered-elk-stack
https://docs.openstack.org/infra/elastic-recheck/readme.html#idea
The rough flow from 2014 (by Sean Dague)
So what they do is that all the INFO logs and test results are send to an ElasticSearch cluster and then analyzed by an adhoc tool. We also have a very old task about collecting logs/tests into ElasticSearch T78705 , Releng talked about it recently but I am not sure we made any progress on that front (others might know better).
I am not suggesting to adopt exactly that elasticresearch thing, but we should at least get inspiration from it?
