Create a better warning system for performance regressions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Peter
	Nov 9 2020, 1:37 PM

Description

I think what we built through the years to collect performance metrics is great: Using WebPageTest, WebPageReplay, Performance device lab and the RUM data. However the data we collect are used independently and then we have a regression in one of them, we go through the rest and try to see the same pattern there. We should try to automate that instead, so that our tool instead could say if we have regression or not.

As a start I think something like this:

Make it easier to understand if we have a performance regression or not. I think we should combine data from all tools we have, and make it easy to know. Best case we should have a green/yellow/red light that show us. It should be easy for us, all Wikimedia developers and everyone in the world to understand if we have a regression or not. Then we need to make it easier to dig into the data (see https://wiki.mozilla.org/TestEngineering/Performance/Sheriffing/Workflow) to make it easier to understand a regression and make possible for people outside the performance team to fix the issue. We should aim for making everything easy to understand so that people outside of the foundation can use it.

We should collect data about the regressions (and alerts) to make it easy to generate reports as Mozilla do in https://blog.mozilla.org/performance/2020/10/15/performance-sheriff-newsletter-september-2020/ - That would make it easier to see how we are doing and how the tools are doing. Keeping statistics of found regression and false regressions is great so we can know that the tools works as they should and we can tune them to work better.

I think this could be a great team goal in the future and it would also help us so we all can work together on one (big) task.

Related Objects

Mentioned Here: T351929: Deploy Mann Whitney U tests for WebPageReplay tests

Event Timeline

Peter created this task.Nov 9 2020, 1:37 PM

Peter moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Nov 9 2020, 7:45 PM

A first start could be to do our alerts more intelligent. Today we have many different tests/alerts that fire and I think we should aim to make them work together.

Example:

To know if we have a regression using WebPageTest we have three different queries that runs using Chrome: We check three URLs and compare if the first visual increased since X days. If all three URLs has increased, we fire an alert.
We also have another alert (another graph) that checks if we got metrics from WebPageTests the last Y hours. That check is there to make sure that WebPageTest works. That alert should have highest priority and the performance team needs to investigate why there's no metrics from WebPageTest.
We can also check the TTFB for our tests the same way we check for the If we also have high TTFB we know something happened with the backend. If there's no increased in backend time, we know something has been pushed.
We also can check TTFB standard deviation, I've seen that over time it can happens that our standard deviation increase and if that's the case, we need to look into the WebPageTest server to see if we can understand what's going on.
We could also know better what has caused a regression: if the regression isn't not tied to the main train going out, it should be easier to automatically narrow down what's been pushed by the data that is in https://wikitech.wikimedia.org/wiki/Server_Admin_Log (or a better format)

That is one alert for WebPageTest for Chrome. Ideally we should combine that with Firefox tests, tests using WebPageReplay to "know" if its a front end regression. And then we could also combine it with the Navigation Timing data we have to better narrow it down.

Let me start creating runbooks for the alerts and see what we can automate.

Peter moved this task from Inbox to Future Goals on the Performance-Device-Lab board.May 23 2022, 4:59 PM

Peter removed a project: WebPageTest.Nov 18 2022, 12:51 PM

larissagaulia moved this task from To-do: Goals, prioritized next 4 Quarters to Backlog: Future Goals, non-prioritized on the Performance-Team board.Jun 19 2023, 6:57 PM

Krinkle removed a project: Performance-Team.Aug 17 2023, 3:08 PM

I think T351929 fixes most of this for us.

Create a better warning system for performance regressionsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Create a better warning system for performance regressions
Closed, ResolvedPublic
Actions