Page MenuHomePhabricator

Automate identifying flaky tests
Closed, DeclinedPublic

Description

Problem: Flaky integration and selenium tests harm developer productivity. While we don't, as far as I know, have a consistent strategy for how to deal with them, a good starting point would be to automatically identify these tests and track the frequency of their occurrence.

One idea would be to query the Gerrit API to look for occurrences of the word "recheck". Recheck isn't a 100% indicator of a flaky test since it can also be used for 1) checking a patch if the submitter isn't in the CI whitelist, 2) checking a patch if the Depends-On state changed. But we could start by having a bot look at "recheck" comments for patches which don't have "Depends-On" and the author is in the CI whitelist; for those patches the bot could create a phab task (if one doesn't exist) and tag it with the not-yet-created #flaky-test tag. The bot could also keep track of the frequency/volume of rechecks so we could see if our integration / selenium tests are getting less or more flaky over time.

recheck comment in the Gerrit Database until June 2018.

SELECT YEAR(written_on) as 'year', MONTH(written_on) as 'month', count(*) as 'rechecks' 
FROM change_messages
WHERE message LIKE '%\nrecheck%'
GROUP BY YEAR(written_on), MONTH(written_on);
yearmonthrechecks
20121115
201212115
2013173
2013242
2013353
20134158
2013541
2013664
2013712
2013812
2013931
20131028
20131119
20131230
2014140
2014246
2014325
2014482
20145141
20146147
20147151
20148174
20149144
201410202
201411190
201412215
20151338
20152399
20153442
20154354
20155238
20156340
20157277
20158258
20159526
201510425
201511336
201512509
20161444
201621058
20163643
20164328
20165419
20166360
20167402
20168426
20169384
201610228
201611400
201612312
20171329
20172314
20173319
20174259
20175273
20176289
20177297
20178306
20179279
201710315
201711297
201712407
20181680
20182600
20183562
20184429
20185410
20186101

In T225193#5242462 @hashar wrote:

OpenStack had a similar need and they wrote a reporter which collect and analyze tests and create a nice report.

https://www.elastic.co/blog/openstack-elastic-recheck-powered-elk-stack
https://docs.openstack.org/infra/elastic-recheck/readme.html#idea

The rough flow from 2014 (by Sean Dague)

ER_ELK_flow.png (499×972 px, 63 KB)

So what they do is that all the INFO logs and test results are send to an ElasticSearch cluster and then analyzed by an adhoc tool. We also have a very old task about collecting logs/tests into ElasticSearch T78705 , Releng talked about it recently but I am not sure we made any progress on that front (others might know better).

I am not suggesting to adopt exactly that elasticresearch thing, but we should at least get inspiration from it?

Event Timeline

It looks like we have +16,500 patches with "recheck" in the comments since January 2014 https://gerrit.wikimedia.org/r/q/recheck,16500

Do we have a sense of what is causing these rechecks? @zeljkofilipin, have any insight into this. I know we've discussed it a little in the past.

I have edited the task description to show comments matching \nrecheck in the database, though the comments are no more in the db since June 2018.

One can retrieve all comments straight from the git repositories. It is easier to do it directly on the Gerrit server though.

BEWARE the following would download EVERY single patchsets ever made to the repo:

$ git fetch +refs/changes/*:refs/remotes/origin/changes
$ git log --oneline --glob=refs/changes/*/*/meta -E --grep '^recheck'| wc -l
3215

That is for mediawiki/core. Formatting the git log by date and grouping by year/month:

mediawiki/core.git$ git log --oneline --glob=refs/changes/*/*/meta -E --grep '^recheck' --pretty=%ci|cut -d- -f-2|uniq -c
     13 2019-06
     58 2019-05
     75 2019-04
     60 2019-03
     38 2019-02
     45 2019-01
     27 2018-12
     35 2018-11
     66 2018-10
     36 2018-09
     67 2018-08
     61 2018-07
     65 2018-06
     49 2018-05
     63 2018-04
     76 2018-03
     60 2018-02
     81 2018-01
     50 2017-12
     51 2017-11
     43 2017-10
     17 2017-09
     36 2017-08
     31 2017-07
     27 2017-06
     53 2017-05
     54 2017-04
     33 2017-03
     47 2017-02
     23 2017-01
     67 2016-12
     63 2016-11
     17 2016-10
     48 2016-09
     25 2016-08
     49 2016-07
     50 2016-06
     91 2016-05
     57 2016-04
     58 2016-03
    144 2016-02
     82 2016-01
     99 2015-12
     40 2015-11
     41 2015-10
     72 2015-09
     28 2015-08
     39 2015-07
     39 2015-06
     29 2015-05
     59 2015-04
     49 2015-03
     53 2015-02
     37 2015-01
     20 2014-12
     25 2014-11
     18 2014-10
     17 2014-09
     61 2014-08
     26 2014-07
      9 2014-06
     10 2014-05
     27 2014-04
      5 2014-03
     13 2014-02
     14 2014-01
      5 2013-12
      6 2013-11
      4 2013-10
      4 2013-09
      2 2013-08
      7 2013-06
     10 2013-05
     36 2013-04
     19 2013-03
     21 2013-02
     40 2013-01
     30 2012-12
     10 2012-11

Just a small note, the comment operator can be used to exclude those (few) results that match the commit messages instead.

As for the numerical growth, I too feel like it's getting worse. But it'd be interesting to see the growth relative to the number of patch sets submitted, which I assume is growing as well, thus causing some amount of bias.

Just a small note, the comment operator can be used to exclude those (few) results that match the commit messages instead.

Thanks. Yes it's not scientific by any means.

Another category of "recheck" type operations are the times when a developer gives a +2, the build fails due to a flaky test, and the same developer removes the +2 and adds a +2 again, or another developer gives a +2.

hashar added a subscriber: awight.

Edited to incorporate a comment I have made on another task ( /T225193#5242462 ) which points at OpenStack recheck bot that does analysis of recheck / test failures to try to find common patterns. I have merged that task here ;)

Not sure there's much interest in pursuing this, so I am declining. Worth noting that rMW4ee71ea8096f: Allow a retry of flaky selenium test goes in the opposite direction of this task anyway, in that it's going to retry a failed test once. So that would make it much harder to rely on "recheck" to surface which tests are problematic.