We have a problem today that we get unstable metrics for our synthetic testing that depends on the server/cloud setup that we are using.
Background
When @Gilles and me was working on implementing a replay proxy in T176361 we tried out many different server solutions: running on AWS, GC, DO, Cloud VPS and bare metal. The solution that gave us most stable metrics was AWS so we deployed our Browsertime/WebPageReplay there.
However we have seen that AWS instances have different stability and can change over time (let me find that task later). We usually have an instability of first visual change (when something is painted in the screen) with 33 ms, but it happens that it increases and we need to test out new AWS servers and redeploy.
What to do
I want to us to try out running on bare metal again and get help on how to configure it (trying out different kernels, setting up everything for best performance over time). It really annoys me that we didn't get good stability the last time we tested on bare metal.
To run the tests, the server needs Docker and needs to be able to send the metrics to Graphite. We need to run the tests for each setup at least a couple of days to know that the metrics are stable.
When we run the tests we start the browser (Firefox/Chrome) and record a video of the screen using FFMPEG. We then analyze the video to find when elements appears on the screen.
Why is this important?
The more stable metrics we get the smaller regressions we can find.