I've been looking into the instability of metrics of Firefox metrics in WebPageTest in T288451 and today thought I found a reason but I think I found a problem that we have across the board for all tests we run on AWS.
We the CPU benchmark JavaScript to measure how "fast" the CPU is in our synthetic tests (the same as we do for some of our real users). When I started dig into the Firefox result I could see that it looked like the CPU performance for most run where 60 ms for our tests but now when we do 11 runs, some of the runs have 90 ms. That is quite hight difference. Then I looked into the Chrome tests and could see the same difference there (Chrome metrics are more stable, but Chrome also get more love in WebPageTest than Firefox). Both browsers get the same, maybe it has something to do with WebPageTest (that the tests CPU benchmark runs at the same time as something else or that the specific AWS instance is broken?
Then I looked at our WebPageReplay tests (that uses sitepeed.io with WebPageReplay). Here are five different runs:
Aha the same pattern here! Independently of tool, we see the same thing. Then it is probably AWS?
We don't have a bare metal server where we run our tests today but I have my own tests that runs on Mac mini M1 (dedicated device). There are numbers are much closer:
However at the moment I only run tests with Safari on the Mac mini so that's a factor. Let me turn on Chrome tests to make 100% sure that the metrics are more stable.
When me and Gilles many years ago started with the WebPageReplay tests and tried different providers we didn't think about doing CPU benchmarks tests over time, so we missed it.
Do the different matters for our other metrics? I'm not sure but I think we should dig deeper and try to run tests in an more isolated environment.


