Loosely related to T212482: RFC: Evolve hook system to support "filters" and "actions" only.
'''Goal''': reduce the time it takes to pin the cause of performance regressions by bridging the gap between high level time-series latency metrics (e.g. per API action) and detailed profiling (flamegraphs) and explicit profiling (XHGUI).
'''Proposal''': add per-action component breakdown time-series graphs (perhaps stacked graphs).
The main ways extensions plug into MediaWiki is via hooks, parser tags, parser functions. Such callbacks might also create deferred updates. Skins plug in via wfLoadSkin().
In order to diagnose save timing problems much more quickly, it would help to have some breakdowns by component/extension (e.g. FlaggedRevs, Echo). This could go beyond save timing, but should be segregated by entry point (save, upload, login,...). Save timing would be a good start in any case. The idea would be to easily spot large increases in runtime attributed to some module so that an appropriate maintainer can be notified and RelEng can have an easy idea sometimes of what to revert. Right now this is very difficult.
Perhaps we can have per-wiki breakdowns for the top wikis.
A related concern is having leaderboards for template and Lua modules...but that's another task itself.
- Add entry point names for metrics in core.
- Enable and send component "hits" per entry point "hit" using wmf-config excimer hooks to find components within stack traces
- Develop Grafana dashboard.
- Have at least one alert for one component that notifies their steward if their cost during save timing increases significantly. Details TBD, e.g. we probably want to instrument this as a counter rather than a per-minute-averaged timer, so that we can aggregate several hours in a statistically accurate way, and measure the proportion on that, rather than the oddly unweighted per-minute averaging we'd otherwise have.