Page MenuHomePhabricator

Deploy Mann Whitney U tests for WebPageReplay tests
Closed, ResolvedPublic

Assigned To
Authored By
Peter
Nov 24 2023, 12:42 PM
Referenced Files
F41716500: Screenshot 2024-01-25 at 13.57.33.png
Jan 25 2024, 12:58 PM
F41697047: Screenshot 2024-01-18 at 10.22.45.png
Jan 18 2024, 9:31 AM
F41697045: Screenshot 2024-01-18 at 10.22.35.png
Jan 18 2024, 9:31 AM
F41697043: Screenshot 2024-01-18 at 10.22.26.png
Jan 18 2024, 9:31 AM
Restricted File
Dec 15 2023, 7:39 AM
F41601399: Screenshot 2023-12-15 at 08.05.29.png
Dec 15 2023, 7:17 AM
F41601397: Screenshot 2023-12-15 at 08.05.16.png
Dec 15 2023, 7:17 AM
F41601395: Screenshot 2023-12-15 at 08.04.00.png
Dec 15 2023, 7:17 AM

Description

I'm done with the coding of using Mann Whitney and its ready to deploy:

  1. Deploy the new version
  2. Increase the number of runs so we do 21 to make sure it's enough to get statistical significance.
  3. Create new alerts that alerts when we have a regression on a couple of metrics.

If it turns out well, change the other alerts and document the new setup.

Details

SubjectRepoBranchLines +/-
performance/synthetic-monitoring-testsmaster+3 -3
performance/synthetic-monitoring-testsmaster+1 -1
performance/synthetic-monitoring-testsmaster+6 -7
performance/synthetic-monitoring-testsmaster+5 -2
performance/synthetic-monitoring-testsmaster+0 -7
performance/synthetic-monitoring-testsmaster+0 -2
performance/synthetic-monitoring-testsmaster+8 -0
performance/synthetic-monitoring-testsmaster+1 -2
performance/synthetic-monitoring-testsmaster+9 -8
performance/synthetic-monitoring-testsmaster+2 -0
performance/synthetic-monitoring-testsmaster+2 -1
performance/synthetic-monitoring-testsmaster+5 -3
performance/synthetic-monitoring-testsmaster+1 -2
performance/synthetic-monitoring-testsmaster+2 -1
performance/synthetic-monitoring-testsmaster+1 -2
performance/synthetic-monitoring-testsmaster+3 -3
performance/synthetic-monitoring-testsmaster+2 -1
performance/synthetic-monitoring-testsmaster+72 -3
performance/synthetic-monitoring-testsmaster+1 -1
performance/synthetic-monitoring-testsmaster+1 -1
performance/synthetic-monitoring-testsmaster+1 -1
performance/synthetic-monitoring-testsmaster+0 -2
performance/synthetic-monitoring-testsmaster+2 -0
performance/synthetic-monitoring-testsmaster+0 -2
performance/synthetic-monitoring-testsmaster+1 -1
performance/synthetic-monitoring-testsmaster+16 -7
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 983723 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Fix bug for running on instant 2.

https://gerrit.wikimedia.org/r/983723

Change 983723 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Fix bug for running on instant 2.

https://gerrit.wikimedia.org/r/983723

Change 983944 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Remove storing baseline tests for the new webpagereplay test.

https://gerrit.wikimedia.org/r/983944

Change 983944 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Remove storing baseline tests for the new webpagereplay test.

https://gerrit.wikimedia.org/r/983944

Change 983966 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Save a new baseline for dekstop tests.

https://gerrit.wikimedia.org/r/983966

Change 983966 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Save a new baseline for desktop tests.

https://gerrit.wikimedia.org/r/983966

Change 983967 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Do not collect baseline.

https://gerrit.wikimedia.org/r/983967

Change 983967 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Do not collect baseline.

https://gerrit.wikimedia.org/r/983967

I've turned on the first alarm. It will fire if First Visual change has a significant change for more than 3 URLS (we test 8).

Let look at some data. This is for the last 30 days. If something is graphed, it's a regression. We can see that we currently have one for the Taylor Swift page,

Screenshot 2024-01-18 at 10.22.26.png (1×2 px, 1 MB)

Screenshot 2024-01-18 at 10.22.35.png (1×2 px, 1 MB)

But we can also see that we have some regression coming and going. I've been looking at the data and the code is correct it's a regression. I wonder though if this is caused by the server running the test?

Screenshot 2024-01-18 at 10.22.45.png (1×2 px, 1 MB)

The actual alert is on another graph for now (it could have been a just a number if Grafana would support alerting for it). It looks at the current number of tests that has a regression higher than 0 (=it's a regression) and if we get more than three it will alert.

The next step to finish this is to deploy tests for Firefox and then also deploy a new version that more regression data in the HTML result. I plan to push the Firefox tests on Friday and then on Monday roll out the new HTML change.

The last step is to:

  1. Create same alert for desktop for Largest Contentful Paint (Chrome)

2 Setup the same alerts for Firefox

  1. Setup the same alerts for emulate mobile.

Remove the old alerts.

Change 991695 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Update container with more baseline information.

https://gerrit.wikimedia.org/r/991695

Change 991695 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Update container with more baseline information.

https://gerrit.wikimedia.org/r/991695

I've rearranged alerting and will turn them on Monday and then update documentation. The last thing to do for now would then make sure that we baseline automatically once a week (Sundays?) so that we through the week runs against a baseline that was created that Sunday. At least that is better than now where I manually makes sure we run a baseline test.

I've added the missing alerts today and updated the documentation. Would like to summarise the work in a blog post and then close this task as finished.

I've reached out to Greg at Mozilla today to ask about a feedback on a couple of things. For some URLs we get changes going up and down for example (the ChatGPT page using Firefox):

Screenshot 2024-01-25 at 13.57.33.png (1×2 px, 405 KB)

Ok, this has been running for over a month now. I'm having a meeting with Gregory Mierzwinski tomorrow for some feedback. Since I've upgraded to latest WebPageReplay we have had no instability in metrics for Chrome. For Firefox we had one page that goes back and forth. Feedback from Greg is that if we only have one URL that fails, maybe we should skip that URL.

Another thing to discuss is when to baseline. When I deployed it the first time I re-baselined for every run (the run before the last run is the baseline). That should work good if we can "trust" the system. One problem is that we need to make sure that we act on the one alert that will fire because the next time we run, we will test against the regression. Right now we re-baselining on Sundays. At the moment we re-baseline all Sunday (all runs that runs creates a new baseline). It seems though that Sundays created noise (this is the first Sunday we re-baselines). I can see that the alert is jumping up and down.

Change 997427 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Disable loading the Firefox extension since it's not used.

https://gerrit.wikimedia.org/r/997427

Change 997427 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Disable loading the Firefox extension since it's not used.

https://gerrit.wikimedia.org/r/997427

I got some feedback from Greg:

  • First let's run the exact same test against the exact same content throughout a day or two and see the variance we have. I need to do some changes in our implementation because today we alway re-record against a fresh version of the page. With the fix, we will run against the exact same version (the way we tried with the static Banksy page). Maybe we can use the Bansky page first.
  • When we have a significant change, alert of the change is larger than 2%. Today we alert on all significant changes.

Change 997524 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Run WPR tests against Banksy.

https://gerrit.wikimedia.org/r/997524

Change 997524 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Run WPR tests against Banksy.

https://gerrit.wikimedia.org/r/997524

Change 997822 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Cleanup tests to make time for running more Mann Whitney U on group 0 and 1.

https://gerrit.wikimedia.org/r/997822

Change 997822 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Cleanup tests to make time for running more Mann Whitney U on group 0 and 1.

https://gerrit.wikimedia.org/r/997822

I removed some old test (beta cluster) and increased the number or runs for group 0 and 1 and then turned on Mann Whitney tests for them too.

Change 997893 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Remove baselining for group 0 and 1.

https://gerrit.wikimedia.org/r/997893

Change 997893 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Remove baselining for group 0 and 1.

https://gerrit.wikimedia.org/r/997893

Change 998146 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Add group 0 and group 1 tests for Firefox.

https://gerrit.wikimedia.org/r/998146

Change 998146 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Add group 0 and group 1 tests for Firefox.

https://gerrit.wikimedia.org/r/998146

Change 998147 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Remove comments in group 0 and 1 files

https://gerrit.wikimedia.org/r/998147

Change 998147 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Remove comments in group 0 and 1 files

https://gerrit.wikimedia.org/r/998147

I'm updating the group 0 and group 1 alerts today for desktop. What's missing to close this is then to do the blog post and do emulated mobile tests for with significant change.

Change 998402 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Remove beta and querybuilder tests.

https://gerrit.wikimedia.org/r/998402

Change 998402 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Remove beta and querybuilder tests.

https://gerrit.wikimedia.org/r/998402

Change 998486 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Run baseline tests for emulated mobile.

https://gerrit.wikimedia.org/r/998486

Change 998486 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Run baseline tests for emulated mobile.

https://gerrit.wikimedia.org/r/998486

All WebPageReplay tests has been updated to use Mann Whitney. I've updated the documentation too, now I need to finish the blog post to close this task.

Change 1001346 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] All WebPageReplay tests should have a new baseline on Sundays.

https://gerrit.wikimedia.org/r/1001346

Change 1001346 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] All WebPageReplay tests should have a new baseline on Sundays.

https://gerrit.wikimedia.org/r/1001346

Change 1001377 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Update to latest sitespeed.io 33.0.0.

https://gerrit.wikimedia.org/r/1001377

Change 1001377 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Update to latest sitespeed.io 33.0.0.

https://gerrit.wikimedia.org/r/1001377

Change 1003607 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[performance/synthetic-monitoring-tests@master] Run 21 runs for all WebPageReplay tests.

https://gerrit.wikimedia.org/r/1003607

Change 1003607 merged by jenkins-bot:

[performance/synthetic-monitoring-tests@master] Run 21 runs for all WebPageReplay tests.

https://gerrit.wikimedia.org/r/1003607