Fresnel times certain tasks before and after a commit, then estimates whether the commit caused the times to increase.
In other words, given two sets of numbers, it asks: are the numbers from the second set higher than the numbers in the first set? A commonly used statistical test for this type of question is the Mann-Whitney U test, a simple test that doesn't assume the data are distributed in a certain way.
We compared the Mann-Whitney U test (MWU) to the current test, which checks whether the standard deviations of the two sets overlap (SD). MWU seems to be better at detecting small time increases; SD mostly detects the extreme cases. (See analysis below.)
Currently, the data are often biased, showing an increase in the second set that isn't really caused by the new commit. T223839 will try to fix this.
Until the data are improved, it is arguably a good thing only to report the extreme cases, to avoid inundating people with incorrect reports. This means we will continue to miss smaller performance hits, but in return people won't lose faith in the reports.
Recommendations
- Replace the current SD test with MWU.
- Until the data are improved, require a very small p-value (e.g. 0.005) to report a failure, so that only the extreme cases are reported. With better data, we can make the threshold less stringent.
Analysis on simulated data
Method:
- generate random numbers, representing "before" and "after" data
- add a constant k to the "after" data to represent a change in performance (increasing negative k representing increasing improvement, increasing positive k representing increasing deterioration, 0 representing no change)
- apply the SD and the MWU tests to the data, reporting a failure for MWU if the p-value is < 0.05
- repeat 1000 times for each k
- repeat for different sample sizes (n = number of times each task is performed. Currently we use 7 in production.)
Results:
Ideally a test should report 0% failures for k <= 0 and 100% failures for k > 0. MWU comes closer to this than SD (although it also reports a few more spurious failures). The higher sample size shows that SD is more biased against detecting small differences.
Analysis on real data
Method:
- manually label 154 real commits according to whether they are expected to harm performance
- for each commit, time how long it takes to perform 8 tasks, before and after the commit (time 7 repetitions of each task)
- apply the SD and MWU tests; if they find that one or more tasks take longer after the commit, report a failure for the commit
Results:
9 commits were manually labelled as harmful. SD correctly identified 2 of the harmful commits, and reported 16 harmless commits as harmful. Requiring a p-value of < 0.05, MWU correctly identified 6 of the 9 harmful commits, but reported 81 harmless commits as harmful. Requiring a p-value of < 0.005, MWU correctly identified 2 of the 9 harmful commits and reported 15 harmless commits as harmful.
A visualisation of the timings for one of the harmless commits that was misclassified demonstrates the problems with the data. (E.g. no statistical test could classify the 7th task correctly!)
(orange dash = timing after the commit; black dash = timing before the commit)