Page MenuHomePhabricator

Automate identifying flaky tests
Open, NormalPublic

Description

Problem: Flaky integration and selenium tests harm developer productivity. While we don't, as far as I know, have a consistent strategy for how to deal with them, a good starting point would be to automatically identify these tests and track the frequency of their occurrence.

One idea would be to query the Gerrit API to look for occurrences of the word "recheck". Recheck isn't a 100% indicator of a flaky test since it can also be used for 1) checking a patch if the submitter isn't in the CI whitelist, 2) checking a patch if the Depends-On state changed. But we could start by having a bot look at "recheck" comments for patches which don't have "Depends-On" and the author is in the CI whitelist; for those patches the bot could create a phab task (if one doesn't exist) and tag it with the not-yet-created #flaky-test tag. The bot could also keep track of the frequency/volume of rechecks so we could see if our integration / selenium tests are getting less or more flaky over time.

recheck comment in the Gerrit Database until June 2018.

SELECT YEAR(written_on) as 'year', MONTH(written_on) as 'month', count(*) as 'rechecks' 
FROM change_messages
WHERE message LIKE '%\nrecheck%'
GROUP BY YEAR(written_on), MONTH(written_on);
yearmonthrechecks
20121115
201212115
2013173
2013242
2013353
20134158
2013541
2013664
2013712
2013812
2013931
20131028
20131119
20131230
2014140
2014246
2014325
2014482
20145141
20146147
20147151
20148174
20149144
201410202
201411190
201412215
20151338
20152399
20153442
20154354
20155238
20156340
20157277
20158258
20159526
201510425
201511336
201512509
20161444
201621058
20163643
20164328
20165419
20166360
20167402
20168426
20169384
201610228
201611400
201612312
20171329
20172314
20173319
20174259
20175273
20176289
20177297
20178306
20179279
201710315
201711297
201712407
20181680
20182600
20183562
20184429
20185410
20186101

Event Timeline

kostajh created this task.May 30 2019, 2:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 30 2019, 2:38 PM

It looks like we have +16,500 patches with "recheck" in the comments since January 2014 https://gerrit.wikimedia.org/r/q/recheck,16500

Do we have a sense of what is causing these rechecks? @zeljkofilipin, have any insight into this. I know we've discussed it a little in the past.

hashar updated the task description. (Show Details)Jun 7 2019, 10:46 AM

I have edited the task description to show comments matching \nrecheck in the database, though the comments are no more in the db since June 2018.

One can retrieve all comments straight from the git repositories. It is easier to do it directly on the Gerrit server though.

BEWARE the following would download EVERY single patchsets ever made to the repo:

$ git fetch +refs/changes/*:refs/remotes/origin/changes
$ git log --oneline --glob=refs/changes/*/*/meta -E --grep '^recheck'| wc -l
3215

That is for mediawiki/core. Formatting the git log by date and grouping by year/month:

mediawiki/core.git$ git log --oneline --glob=refs/changes/*/*/meta -E --grep '^recheck' --pretty=%ci|cut -d- -f-2|uniq -c
     13 2019-06
     58 2019-05
     75 2019-04
     60 2019-03
     38 2019-02
     45 2019-01
     27 2018-12
     35 2018-11
     66 2018-10
     36 2018-09
     67 2018-08
     61 2018-07
     65 2018-06
     49 2018-05
     63 2018-04
     76 2018-03
     60 2018-02
     81 2018-01
     50 2017-12
     51 2017-11
     43 2017-10
     17 2017-09
     36 2017-08
     31 2017-07
     27 2017-06
     53 2017-05
     54 2017-04
     33 2017-03
     47 2017-02
     23 2017-01
     67 2016-12
     63 2016-11
     17 2016-10
     48 2016-09
     25 2016-08
     49 2016-07
     50 2016-06
     91 2016-05
     57 2016-04
     58 2016-03
    144 2016-02
     82 2016-01
     99 2015-12
     40 2015-11
     41 2015-10
     72 2015-09
     28 2015-08
     39 2015-07
     39 2015-06
     29 2015-05
     59 2015-04
     49 2015-03
     53 2015-02
     37 2015-01
     20 2014-12
     25 2014-11
     18 2014-10
     17 2014-09
     61 2014-08
     26 2014-07
      9 2014-06
     10 2014-05
     27 2014-04
      5 2014-03
     13 2014-02
     14 2014-01
      5 2013-12
      6 2013-11
      4 2013-10
      4 2013-09
      2 2013-08
      7 2013-06
     10 2013-05
     36 2013-04
     19 2013-03
     21 2013-02
     40 2013-01
     30 2012-12
     10 2012-11
Krinkle added a comment.EditedJun 11 2019, 2:16 PM

Just a small note, the comment operator can be used to exclude those (few) results that match the commit messages instead.

As for the numerical growth, I too feel like it's getting worse. But it'd be interesting to see the growth relative to the number of patch sets submitted, which I assume is growing as well, thus causing some amount of bias.

Just a small note, the comment operator can be used to exclude those (few) results that match the commit messages instead.

Thanks. Yes it's not scientific by any means.

Another category of "recheck" type operations are the times when a developer gives a +2, the build fails due to a flaky test, and the same developer removes the +2 and adds a +2 again, or another developer gives a +2.