Page MenuHomePhabricator

Create a better warning system for performance regressions
Closed, ResolvedPublic

Description

I think what we built through the years to collect performance metrics is great: Using WebPageTest, WebPageReplay, Performance device lab and the RUM data. However the data we collect are used independently and then we have a regression in one of them, we go through the rest and try to see the same pattern there. We should try to automate that instead, so that our tool instead could say if we have regression or not.

As a start I think something like this:

  1. Make it easier to understand if we have a performance regression or not. I think we should combine data from all tools we have, and make it easy to know. Best case we should have a green/yellow/red light that show us. It should be easy for us, all Wikimedia developers and everyone in the world to understand if we have a regression or not. Then we need to make it easier to dig into the data (see https://wiki.mozilla.org/TestEngineering/Performance/Sheriffing/Workflow) to make it easier to understand a regression and make possible for people outside the performance team to fix the issue. We should aim for making everything easy to understand so that people outside of the foundation can use it.
  1. We should collect data about the regressions (and alerts) to make it easy to generate reports as Mozilla do in https://blog.mozilla.org/performance/2020/10/15/performance-sheriff-newsletter-september-2020/ - That would make it easier to see how we are doing and how the tools are doing. Keeping statistics of found regression and false regressions is great so we can know that the tools works as they should and we can tune them to work better.

I think this could be a great team goal in the future and it would also help us so we all can work together on one (big) task.

Event Timeline

A first start could be to do our alerts more intelligent. Today we have many different tests/alerts that fire and I think we should aim to make them work together.

Example:

  • To know if we have a regression using WebPageTest we have three different queries that runs using Chrome: We check three URLs and compare if the first visual increased since X days. If all three URLs has increased, we fire an alert.
  • We also have another alert (another graph) that checks if we got metrics from WebPageTests the last Y hours. That check is there to make sure that WebPageTest works. That alert should have highest priority and the performance team needs to investigate why there's no metrics from WebPageTest.
  • We can also check the TTFB for our tests the same way we check for the If we also have high TTFB we know something happened with the backend. If there's no increased in backend time, we know something has been pushed.
  • We also can check TTFB standard deviation, I've seen that over time it can happens that our standard deviation increase and if that's the case, we need to look into the WebPageTest server to see if we can understand what's going on.
  • We could also know better what has caused a regression: if the regression isn't not tied to the main train going out, it should be easier to automatically narrow down what's been pushed by the data that is in https://wikitech.wikimedia.org/wiki/Server_Admin_Log (or a better format)

That is one alert for WebPageTest for Chrome. Ideally we should combine that with Firefox tests, tests using WebPageReplay to "know" if its a front end regression. And then we could also combine it with the Navigation Timing data we have to better narrow it down.

Let me start creating runbooks for the alerts and see what we can automate.

Peter claimed this task.

I think T351929 fixes most of this for us.