Page MenuHomePhabricator

Profile and visualise time spent per component/extension in MW entry points
Open, MediumPublic

Description

Loosely related to T212482: RFC: Evolve hook system to support "filters" and "actions" only.

'''Goal''': reduce the time it takes to pin the cause of performance regressions by bridging the gap between high level time-series latency metrics (e.g. per API action) and detailed profiling (flamegraphs) and explicit profiling (XHGUI).

'''Proposal''': add per-action component breakdown time-series graphs (perhaps stacked graphs).

The main ways extensions plug into MediaWiki is via hooks, parser tags, parser functions. Such callbacks might also create deferred updates. Skins plug in via wfLoadSkin().

In order to diagnose save timing problems much more quickly, it would help to have some breakdowns by component/extension (e.g. FlaggedRevs, Echo). This could go beyond save timing, but should be segregated by entry point (save, upload, login,...). Save timing would be a good start in any case. The idea would be to easily spot large increases in runtime attributed to some module so that an appropriate maintainer can be notified and RelEng can have an easy idea sometimes of what to revert. Right now this is very difficult.

Perhaps we can have per-wiki breakdowns for the top wikis.

A related concern is having leaderboards for template and Lua modules...but that's another task itself.

  • Add entry point names for metrics in core.
  • Enable and send component "hits" per entry point "hit" using wmf-config excimer hooks to find components within stack traces
  • Develop Grafana dashboard.
  • Have at least one alert for one component that notifies their steward if their cost during save timing increases significantly. Details TBD, e.g. we probably want to instrument this as a counter rather than a per-minute-averaged timer, so that we can aggregate several hours in a statistically accurate way, and measure the proportion on that, rather than the oddly unweighted per-minute averaging we'd otherwise have.

Event Timeline

This mechanism could also be used to enrich exception traces in Logstash with a string representing the name of an extension, if the exception came from a hook handler. E.g. by using the last string in the array that this feature would use, and pulling it in from Logger/WikiProcessor in some way.

Krinkle triaged this task as High priority.Jul 10 2019, 3:36 PM

We now have dedicated PreSend and PostSend flame graphs published daily, which covers a fair bit of the original use case for this task.

https://performance.wikimedia.org/php-profiling/

The original idea of sending this to Grafana (eg Statsd ro Prometheus) as such is less needed now, but we're recycling this for tracking Save Timing perf budgets, where we do specifically want to plot by higher-level components.

Change 698947 had a related patch set uploaded (by Krinkle; author: Kosta Harlan):

[mediawiki/core@master] DeferredUpdates: Log execution time for updates

https://gerrit.wikimedia.org/r/698947

Change 698947 merged by jenkins-bot:

[mediawiki/core@master] DeferredUpdates: Log execution time for updates

https://gerrit.wikimedia.org/r/698947

Having this would have sped up investigation at T300075 by immediately attributing the added time to the relevant extension (i.e. TheWikipediaLibrary in that example).

It may also have prevented it from reaching group2 through a more localised alert threshold.

aaron lowered the priority of this task from High to Medium.Oct 5 2022, 6:43 PM
aaron updated the task description. (Show Details)

Change 893839 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] DNM: add per-action component-level profiling in statsd using excimer

https://gerrit.wikimedia.org/r/893839

aaron renamed this task from Per component/extension profiling of hooks and pre-send DeferredUpdates with Grafana dashboards to Per component/skin/extension profiling of entry points with Grafana dashboards.Mar 24 2023, 9:29 PM

Change 893839 merged by jenkins-bot:

[operations/mediawiki-config@master] Add per-action component-level profiling in statsd using excimer

https://gerrit.wikimedia.org/r/893839

Mentioned in SAL (#wikimedia-operations) [2023-03-29T20:17:08Z] <taavi@deploy2002> Started scap: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-29T20:18:33Z] <taavi@deploy2002> aaron and taavi: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-03-29T20:29:00Z] <taavi@deploy2002> Finished scap: Backport for [[gerrit:893839|Add per-action component-level profiling in statsd using excimer (T225968)]] (duration: 11m 52s)

Krinkle renamed this task from Per component/skin/extension profiling of entry points with Grafana dashboards to Profile and visualise time spent per component/extension in MW entry points.Apr 10 2023, 6:06 PM