Page MenuHomePhabricator

MediaWiki load time regression should trigger an alarm / page people
Closed, ResolvedPublic

Description

We have a lot of metrics related to how long it takes for MediaWiki to load on the client browsers. Over the last few months we had to rollback wmf deployment due to huge regression in the loading time.

The latest example is 1.28.0-wmf.18 that caused the load time to surge from ~ 800ms max to roughly 1200 ms on average as reported on T146099: mw-1.28.0-wmf.18 load-time regression:

From T143328#2651001 , that caused us to rollback wmf.19 again on Sept 19th, even though the regression occured on Sept 9th or 10 days before.

What really matter is that the huge regression has been left unnoticed for 10 days and as such we should alarm / notify the right people so it can be taken care of.

Notably, Release-Engineering-Team definitely needs an alarm if the load time is impacted by a new MediaWiki version so we can rollback on spot.

Related Graphana board:
https://grafana-admin.wikimedia.org/dashboard/db/performance-metrics?panelId=7&fullscreen

Event Timeline

Performance-Team is considering making this the focus of our off-site.

Gilles claimed this task.
Gilles added a subscriber: Gilles.

Our new Grafana-based performance alerts have been working for a few months now and have caught a couple of incidents already.