Page MenuHomePhabricator

Use Mann-Whitney U test to compare sampled times in Fresnel
Closed, ResolvedPublic

Description

Fresnel times certain tasks before and after a commit, then estimates whether the commit caused the times to increase.

In other words, given two sets of numbers, it asks: are the numbers from the second set higher than the numbers in the first set? A commonly used statistical test for this type of question is the Mann-Whitney U test, a simple test that doesn't assume the data are distributed in a certain way.

We compared the Mann-Whitney U test (MWU) to the current test, which checks whether the standard deviations of the two sets overlap (SD). MWU seems to be better at detecting small time increases; SD mostly detects the extreme cases. (See analysis below.)

Currently, the data are often biased, showing an increase in the second set that isn't really caused by the new commit. T223839 will try to fix this.

Until the data are improved, it is arguably a good thing only to report the extreme cases, to avoid inundating people with incorrect reports. This means we will continue to miss smaller performance hits, but in return people won't lose faith in the reports.

Recommendations

  • Replace the current SD test with MWU.
  • Until the data are improved, require a very small p-value (e.g. 0.005) to report a failure, so that only the extreme cases are reported. With better data, we can make the threshold less stringent.

Analysis on simulated data

Method:

  1. generate random numbers, representing "before" and "after" data
  2. add a constant k to the "after" data to represent a change in performance (increasing negative k representing increasing improvement, increasing positive k representing increasing deterioration, 0 representing no change)
  3. apply the SD and the MWU tests to the data, reporting a failure for MWU if the p-value is < 0.05
  4. repeat 1000 times for each k
  5. repeat for different sample sizes (n = number of times each task is performed. Currently we use 7 in production.)

Results:
Ideally a test should report 0% failures for k <= 0 and 100% failures for k > 0. MWU comes closer to this than SD (although it also reports a few more spurious failures). The higher sample size shows that SD is more biased against detecting small differences.

image.png (420×1 px, 42 KB)

Analysis on real data

Method:

  1. manually label 154 real commits according to whether they are expected to harm performance
  2. for each commit, time how long it takes to perform 8 tasks, before and after the commit (time 7 repetitions of each task)
  3. apply the SD and MWU tests; if they find that one or more tasks take longer after the commit, report a failure for the commit

Results:
9 commits were manually labelled as harmful. SD correctly identified 2 of the harmful commits, and reported 16 harmless commits as harmful. Requiring a p-value of < 0.05, MWU correctly identified 6 of the 9 harmful commits, but reported 81 harmless commits as harmful. Requiring a p-value of < 0.005, MWU correctly identified 2 of the 9 harmful commits and reported 15 harmless commits as harmful.

A visualisation of the timings for one of the harmless commits that was misclassified demonstrates the problems with the data. (E.g. no statistical test could classify the 7th task correctly!)

(orange dash = timing after the commit; black dash = timing before the commit)

image.png (494×632 px, 28 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 511687 had a related patch set uploaded (by Tchanders; owner: Tchanders):
[performance/fresnel@master] Add Mann-Whitney U test to compute module

https://gerrit.wikimedia.org/r/511687

Change 511687 merged by jenkins-bot:
[performance/fresnel@master] Add Mann-Whitney U test to compute module

https://gerrit.wikimedia.org/r/511687

Change 541564 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/fresnel@master] compute: Rename compareStdev() to diffStdev()

https://gerrit.wikimedia.org/r/541564

Change 541565 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/fresnel@master] compute: Add diffMannWhitney()

https://gerrit.wikimedia.org/r/541565

Change 541566 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/fresnel@master] reports: Switch paint timing judgement from Stdev to Mann-Whitney

https://gerrit.wikimedia.org/r/541566

Change 541635 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[performance/fresnel@master] test: Update paint timing compare fixture to be more realistic

https://gerrit.wikimedia.org/r/541635

Change 541564 merged by jenkins-bot:
[performance/fresnel@master] compute: Rename compareStdev() to diffStdev()

https://gerrit.wikimedia.org/r/541564

Change 541565 merged by jenkins-bot:
[performance/fresnel@master] compute: Add diffMannWhitney()

https://gerrit.wikimedia.org/r/541565

Change 541566 merged by jenkins-bot:
[performance/fresnel@master] reports: Switch paint timing judgement from Stdev to Mann-Whitney

https://gerrit.wikimedia.org/r/541566

Change 541635 merged by jenkins-bot:
[performance/fresnel@master] test: Update paint timing compare fixture to be more realistic

https://gerrit.wikimedia.org/r/541635