I think what we built through the years to collect performance metrics is great: Using WebPageTest, WebPageReplay, Performance device lab and the RUM data. However the data we collect are used independently and then we have a regression in one of them, we go through the rest and try to see the same pattern there. We should try to automate that instead, so that our tool instead could say if we have regression or not.
As a start I think something like this:
- Make it easier to understand if we have a performance regression or not. I think we should combine data from all tools we have, and make it easy to know. Best case we should have a green/yellow/red light that show us. It should be easy for us, all Wikimedia developers and everyone in the world to understand if we have a regression or not. Then we need to make it easier to dig into the data (see https://wiki.mozilla.org/TestEngineering/Performance/Sheriffing/Workflow) to make it easier to understand a regression and make possible for people outside the performance team to fix the issue. We should aim for making everything easy to understand so that people outside of the foundation can use it.
- We should collect data about the regressions (and alerts) to make it easy to generate reports as Mozilla do in https://blog.mozilla.org/performance/2020/10/15/performance-sheriff-newsletter-september-2020/ - That would make it easier to see how we are doing and how the tools are doing. Keeping statistics of found regression and false regressions is great so we can know that the tools works as they should and we can tune them to work better.
I think this could be a great team goal in the future and it would also help us so we all can work together on one (big) task.