Page MenuHomePhabricator

Add per-request flamegraph option to WikimediaDebug
Closed, ResolvedPublic

Description

Why

Today, to analyze backend performance in production, we provide Arc Lamp with flamegraphs that accurately visualise the function-call hierarchy of the entire code base.

Screenshot 2023-05-29 at 13.14.44.png (932×2 px, 748 KB)

For the equivalent usecase during debugging, we currently have XHGui: https://wikitech.wikimedia.org/wiki/WikimediaDebug#XHGui_profiling (example):

Screenshot 2023-05-29 at 13.16.12.png (1×2 px, 446 KB)

XHGui provides solid information about memory use and function call count, which is useful and specific debugging. However, it completely lacks information about the function-call hierarchy, and lacks any accuracy in its timing information to understand speed.

This creates a significant gap in the debugging experience, basically limiting it to performance problems relating to memory usage, and database/storage calls; with no story for analyzing speed and latency.

Worse yet, XHGui deceptively provides hierarchal and timing data that (if interpreted as such) is inaccurate.

Goal

For this task, I'd like to achieve the following:

  1. Able to visualise a debug request as a fairly detailed and standard (bottom-up) flamegraph, and also as a reversed flamegraph.
  2. Able to share links to filtered views of said flamegraph.
  3. Able to combine multiple requests into one flamegraph.

Background

To learn more about why XHGui/Tideways creates high overhead, refer to the Profiling PHP § How profiling can be expensive blog post.

This would be an alternative to XHGui, which has a number of drawbacks:

  • Non-trivial function call overhead. This makes the "time spent" in any given function heavily skewed towards functions that make many functions. Of course, some amount of function call overhead is "real", but there is non-trivial additional overhead added by XHProf. Thus if on a production request function "foo" and "bar" both take 100ms to complete, but "bar" internally calls many functions, than in an XHProf report it will appear as if "bar" is much slower than it actually is. (Well, its true that it was that slow, but only when XHProf is active.)
  • No call tree. For something self-describing as a "hierarchical profiler", XHProf has suprisingly little "hierarchical" information. It only measures data for each "parent-child" function name pair.
  • It offers no visualisation. Naturally, per previous point.
  • It is not meant to aggregate multiple requests, and we currently don't. (In theory we could buffer data on mwdebug and merge multiple datasets before forwarding to the XHGui database with faked request metadata).

For example, the following call tree:

* main
  * A1
    * B
       * C
  * A2
    * B
       * C

xhprof-tideways will report the time flattened as:

* main
* main-A1
* main-A2
* A1-B
* A2-B
* B-C

It's not possible (or rather it's hard, and involves guesswork) to figure out how much time of "A1" was spent in C.

The interface appears to let you "dig in" to lower calls, but every time you click a function name, you're really navigating sideways, not downward. Which means numbers don't add up, and this can be confusing even to people who use the tool regularly and know the tested code well.

Implementation

Figure out:

  • Decide how to collect traces: Excimer.
  • Decide where to store traces: Misc DB cluster, same as XHGui, via a new "excimer-ui-server" HTTP API.
  • Decide where to generate or store flame graphs: No storage, generate in-browser via self-hosted Speedscope.
  • Evaluate retention and possible abuse: Prune after insert in excimer-ui-server. Use hmac secret to avoid abuse that modifies exiting records. No rate limit at this time.

Next steps from T291015#8576699, @Krinkle wrote:

  • client: Rename Client\Profiler to Client\ExcimerClient.
  • client: Add "secret" option to prevent abuse, make ingestion IDs different from real/public/read-only IDs.
  • DBA: Request a misc DB for excimer-ui-server. T331956: Create "excimer" misc database
  • mediawiki: Package php-exicmer 1.1.0 or later and depoy it. T332964: Upgrade php-excimer package from 1.0.4 to 1.1.1
  • perf site: Enable PHP on perf.wm.o.
  • perf site: Use Puppet to provision a JSON config file with db credentials, set ENV in apache for excimer-ui-server discovery.
  • perf site: Check in excimer-ui-server with vendor.
  • mediawiki: Add ExcimerClient to wmf-config/lib and modify wmf-config/Profiler to permit its use when using WikimediaDebug. Trigger without need for query param via XWD attribute.
  • WikimediaDebug: Add frontend option browser extension.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+9 -3
performance/WikimediaDebugmaster+39 -34
performance/WikimediaDebugmaster+27 -7
operations/mediawiki-configmaster+414 -1
operations/puppetproduction+1 -1
operations/puppetproduction+64 -9
labs/privatemaster+4 -0
performance/docrootmaster+30 K -0
labs/privatemaster+1 -0
performance/excimer-ui-servermaster+50 -0
performance/excimer-ui-clientmaster+23 -0
performance/WikimediaDebugmaster+160 -125
mediawiki/php/excimermaster+258 -23
mediawiki/coremaster+3 -1
performance/excimer-ui-clientmaster+645 -0
performance/excimer-ui-servermaster+30 -1
integration/configmaster+23 -23
integration/configmaster+56 -56
integration/configmaster+238 -0
performance/excimer-ui-servermaster+4 K -0
integration/configmaster+8 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Send or store somewhere

How about posting the whole output of ExcimerLog::formatCollapsed(), unmodified, to an HTTP service?

Generate flamegraph somewhere

How about generating it on the fly when the request comes in?

We could port flamegraph.pl to PHP, it's only 1200 lines with no dependencies.

Or we could generate the flame graph with client-side JS, but embed the JS in a PHP script which also delivers the log from storage. Storage can be anything that PHP can talk to.

I know Ori liked Python and shell scripts, but I'm not really a Python coder.

More concrete proposal:

  • Create Gerrit project performance/excimer-ui
  • Client library and server in the one repo, like Shellbox?

Dependency-free auto-prepend helper:

use Wikimedia\ExcimerUI\ProfilerSetup;
require __DIR__ . '/vendor/wikimedia/excimer-ui/src/ProfilerSetup.php';
ProfilerSetup::setup( [
    'url' => 'https://performance.wikimedia.org/excimer-ui/',
] );

Activation modes:

  • Auto (unconditional)
  • Query string
  • Query string triggers Set-Cookie
  • X-Wikimedia-Debug

Client state access via class static methods:

use Wikimedia\ExcimerUI\ProfilerState;

class SomeApp {
    function getFooterLink() {
        if ( ProfilerState::isEnabled() ) {
            return ProfilerState::makeLink( [ /* options */ ] );
        } else {
            return '[profiling is disabled]';
        }
    }
}

Server features:

  • Speedscope looks nice, can already understand formatCollapsed() output. Maybe we can just integrate it.
  • Table view: I have a table view in my LocalSettings.php which is sometimes handy.
  • Line view: I have a very basic (not suitable for humans) line view in my LocalSettings.php. Maybe something like this could be implemented, using local checkouts of public source, with a path map.

Server entry point and config similar to Shellbox.

The ProfilerSetup/ProfilerState split is awkward. ProfilerSetup needs state. And maybe it is nice to have a singleton object in there somewhere, as a setup/config bundle, and for testability.

Using Speedscope as-is is simple enough. You can specify a URL to download as a URL fragment parameter. But making any sort of change to the output is more complicated. Even to just make a few tweaks to the HTML page it's embedded in, it looks like I'd have to patch the source and rebuild it.

I added speedscope output to Excimer. Seems to work.

Change 865847 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/excimer@master] Add speedscope output

https://gerrit.wikimedia.org/r/865847

I would like approval or alternative ideas for the Gerrit repo names before I create them, since they can't be changed after creation. Current idea:

  • performance/excimer-ui/client: Provides the class Wikimedia\ExcimerUI\Client\Profiler. Its job is to manage the ExcimerProfiler, to post the results to the server, and to make a link to the server. Array configuration. No dependencies (except PHP extensions). You can bundle the class file if you don't want to use composer.
  • performance/excimer-ui/server: Provides a storage and retrieval interface for profiling results. Hosts speedscope by proxying requests to node_modules/speedscope. Reads a JSON configuration file. Some dependencies.

Both components are basically done and will be ready to commit after a little bit of polishing.

Other updates:

  • I added merging of JSON files. So it is now possible to perform multiple MW requests with the same profile ID. The server PHP code merges the JSON blobs on demand, delivering a speedscope file with multiple elements in the "profiles" array. Speedscope presents a single profile at a time, with a drop-down selector to switch to another profile in the same file. There's no aggregation feature. Maybe one could be added to speedscope if we want that.
  • I added compression. The zlib extension is now required for the client and server.
  • I made a pull request to bundle the font. No response yet.

@tstarling Two suggestions to consider, but fine as-is as well if you prefer that after considering these (if you hadn't already):

  • Perhaps less nesting, as performance/excimer-ui-client and performance/excimer-ui-server. I find that nesting tends to make more complex and make more things harder than it eases. I tend to learn to flatter repos for ease of git-clone (naturally the right name), ACL inheritence, and perhaps more obvious mapping between packages names (and their local checkouts) to where the git repo is hosted, just under a namespace for ACL/ownership.
  • The new excimer debug client seems in-scope for https://github.com/wikimedia/arc-lamp where we currently house the excimer sampling client already, and I wouldn't mind hosting the server there as well. It's a matter of taste and branding though. I feel it might be easier to get more attention from open-source and adoption elsewhere if it's more concentrated, but it's certainly not the only factor. No objection either way.

Change 865847 merged by jenkins-bot:

[mediawiki/php/excimer@master] Add speedscope output

https://gerrit.wikimedia.org/r/865847

Change 868219 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[performance/excimer-ui-client@master] Excimer UI Client

https://gerrit.wikimedia.org/r/868219

Change 868222 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[performance/excimer-ui-server@master] Excimer UI Server

https://gerrit.wikimedia.org/r/868222

Change 868477 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [WIP] Excimer profile link footer

https://gerrit.wikimedia.org/r/868477

Change 874959 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[integration/config@master] Add the new profiling libraries

https://gerrit.wikimedia.org/r/874959

Change 874959 merged by jenkins-bot:

[integration/config@master] Add the new profiling libraries

https://gerrit.wikimedia.org/r/874959

Change 868222 merged by jenkins-bot:

[performance/excimer-ui-server@master] Excimer UI Server

https://gerrit.wikimedia.org/r/868222

Change 875443 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[integration/config@master] Add Excimer packages to PHP 7.4+ dockerfiles

https://gerrit.wikimedia.org/r/875443

Change 876186 had a related patch set uploaded (by Hashar; author: Tim Starling):

[integration/config@master] jjb: update jobs for php-excimer

https://gerrit.wikimedia.org/r/876186

Change 875443 merged by jenkins-bot:

[integration/config@master] Add Excimer packages to PHP 7.4+ dockerfiles

https://gerrit.wikimedia.org/r/875443

Change 876186 merged by jenkins-bot:

[integration/config@master] jjb: update jobs for php-excimer

https://gerrit.wikimedia.org/r/876186

Change 876190 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: update quibble jobs for php-excimer

https://gerrit.wikimedia.org/r/876190

Change 876190 merged by jenkins-bot:

[integration/config@master] jjb: update quibble jobs for php-excimer

https://gerrit.wikimedia.org/r/876190

Change 868219 merged by jenkins-bot:

[performance/excimer-ui-client@master] Excimer UI Client

https://gerrit.wikimedia.org/r/868219

Next steps:

  • client: Rename Client\Profiler to Client\ExcimerClient.
  • client: Add "secret" option to prevent abuse, make ingestion IDs different from real/public/read-only IDs.
  • DBA: Request a misc DB for excimer-ui.
  • perf site: Enable PHP on perf.wm.o.
  • perf site: Use Puppet to provision a JSON config file with db credentials, set ENV in apache for excimer-ui-server discovery.
  • perf site: Check in excimer-ui-server with vendor.
  • mediawiki: Add ExcimerClient to wmf-config/lib.
  • mediawiki: Modify wmf-config/Profiler to permit its use when using WikimediaDebug. Trigger without need for query param via XWD attribute.
  • mediawiki: Merge the footer link patch, especially for local dev where you'd trigger by query param without WikimediaDebug.
  • WikimediaDebug: Add frontend option browser extension.

Change 888124 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[performance/excimer-ui-server@master] Validate the profile ID hash

https://gerrit.wikimedia.org/r/888124

Change 888124 merged by jenkins-bot:

[performance/excimer-ui-server@master] Validate the profile ID hash

https://gerrit.wikimedia.org/r/888124

Change 868477 abandoned by Tim Starling:

[mediawiki/core@master] [WIP] Excimer profile link footer

Reason:

https://gerrit.wikimedia.org/r/868477

Change 899035 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/WikimediaDebug@master] popup: Create new "output" area with last five xhgui/logstash links

https://gerrit.wikimedia.org/r/899035

Change 902526 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/excimer-ui-client@master] Add NOTICE file to repository

https://gerrit.wikimedia.org/r/902526

Change 902529 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Profiler: Implement "Excimer UI" option for WikimediaDebug

https://gerrit.wikimedia.org/r/902529

Added to the task description:

  • mediawiki: Package php-exicmer 1.1.0 or later and depoy it.

We're currently on 1.0.4 which lacks https://gerrit.wikimedia.org/r/c/mediawiki/php/excimer/+/865847 for the Speedscope-compatible output format.

Change 902566 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/WikimediaDebug@master] Add new "Excimer UI" option

https://gerrit.wikimedia.org/r/902566

Change 899035 merged by jenkins-bot:

[performance/WikimediaDebug@master] popup: Create new "output" area with last five xhgui/logstash links

https://gerrit.wikimedia.org/r/899035

@jijiki (Capturing here from last month's Perf:SvcOps meeting)

As part of this goal, we're developed a few small PHP files to add to the performance.wikimedia.org micro site that reads or writes a single row from a misc mysql database. This is quite similar to what we do with XHGui today at https://performance.wikimedia.org/xghui/, which serves and displays debug profiles from the misc database. XHGui was actually hosted on a separate server with a reverse proxy from perf.wm.o`/xhgui/*`, hence perf.wm.o is still technically a static HTML site, it has no PHP installed.

I'm asking for help to provision this, or to at least indicate the preferred way we provision it. I.e. Apache with mod_php (simple, matches profile::webperf::xhgui in Puppet), or php-fpm (more complex, but seems to be more common nowadays?).

Change 902526 merged by jenkins-bot:

[performance/excimer-ui-client@master] Add NOTICE file to repository

https://gerrit.wikimedia.org/r/902526

Change 910840 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/excimer-ui-server@master] public_html: Handle deploynent within vendor/

https://gerrit.wikimedia.org/r/910840

Change 910842 had a related patch set uploaded (by Krinkle; author: Krinkle):

[labs/private@master] Define dummy pass for passwords::excimer_ui_server

https://gerrit.wikimedia.org/r/910842

Change 910856 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] webperf: enable libapache2-mod-php7.3 on profile::webperf::site

https://gerrit.wikimedia.org/r/910856

Change 910858 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/excimer-ui-server@master] docs: Document installation and configuration options

https://gerrit.wikimedia.org/r/910858

Change 910863 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/docroot@master] excimer-ui-server: Check in source and dependencies

https://gerrit.wikimedia.org/r/910863

Mentioned in SAL (#wikimedia-cloud) [2023-04-23T16:55:04Z] <Krinkle> Fix profile::tlsproxy::envoy::global_cert_name in Horizon for webperf host to use '%{facts.fqdn}' instead of performance.discovery.wmnet as the latter doesn't resolve / would be an invalid cert for https://deployment-webperf21, ref T291015

Change 910858 merged by jenkins-bot:

[performance/excimer-ui-server@master] docs: Document installation and configuration options

https://gerrit.wikimedia.org/r/910858

Change 912315 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[labs/private@master] hieradata: add secrets for excimer perf site

https://gerrit.wikimedia.org/r/912315

Change 912315 merged by Effie Mouzeli:

[labs/private@master] hieradata: add secrets for excimer perf site

https://gerrit.wikimedia.org/r/912315

Change 910863 merged by jenkins-bot:

[performance/docroot@master] excimer-ui-server: Check in source and dependencies

https://gerrit.wikimedia.org/r/910863

Krinkle changed the task status from Open to Stalled.May 8 2023, 4:25 PM

Change 910842 merged by Marostegui:

[labs/private@master] Define dummy pass for passwords::excimer_ui_server

https://gerrit.wikimedia.org/r/910842

Change 910856 merged by Filippo Giunchedi:

[operations/puppet@production] webperf: enable libapache2-mod-php7.4 on profile::webperf::site

https://gerrit.wikimedia.org/r/910856

Change 919419 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] webperf: Expose /excimer/ingest/ to internal requests only

https://gerrit.wikimedia.org/r/919419

Change 919422 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] webperf: Fix excimer_mysql_user typo

https://gerrit.wikimedia.org/r/919422

Change 919422 merged by Filippo Giunchedi:

[operations/puppet@production] webperf: Fix excimer_mysql_user typo

https://gerrit.wikimedia.org/r/919422

Change 919870 had a related patch set uploaded (by Krinkle; author: Krinkle):

[performance/WikimediaDebug@master] popup: Expand limit from 5 to 100, include non-main, add scrollTo

https://gerrit.wikimedia.org/r/919870

Change 899035 merged by jenkins-bot:

[performance/WikimediaDebug@master] popup: Create new "output" area with last five xhgui/logstash links

https://gerrit.wikimedia.org/r/899035

capture.png (778×620 px, 139 KB)

Updated design:

Screenshot 2023-05-15 at 17.47.42.png (1×582 px, 176 KB)

Krinkle changed the task status from Stalled to Open.May 15 2023, 5:01 PM

Change 902529 merged by jenkins-bot:

[operations/mediawiki-config@master] Profiler: Implement "Excimer UI" option for WikimediaDebug

https://gerrit.wikimedia.org/r/902529

Change 902566 merged by jenkins-bot:

[performance/WikimediaDebug@master] Add new "Excimer UI" option

https://gerrit.wikimedia.org/r/902566

Change 902529 merged by jenkins-bot:

[operations/mediawiki-config@master] Profiler: Implement "Excimer UI" option for WikimediaDebug

https://gerrit.wikimedia.org/r/902529

@Krinkle Can you please deploy this if possible? It is blocking further config deployments.

It was already deployed last night shortly after the above merge (5 hours before the above comment). It is also applied to the deployment host already, however I had not run the (no-op) git-fetch command to update the git remote origin pointer. It turns out that scap backport is sensitive to this, and I hadn't realised this. I'll keep that in mind in the future. Sorry about that.

Change 919870 merged by jenkins-bot:

[performance/WikimediaDebug@master] popup: Expand limit from 5 to 100, include non-main, add scrollTo

https://gerrit.wikimedia.org/r/919870

The above is now released as WikimediaDebug 2.7.0 and was published to the Firefox and Chrome stores last Friday.

It has in the meantime been approved by both stores with browsers auto-updating existing installs over the next day or two.

Also done are the docs on Wikitech wiki: https://wikitech.wikimedia.org/wiki/WikimediaDebug#Request_profiling

Remaining:

  • Announce to Wikitech-l.

Change 919419 merged by Filippo Giunchedi:

[operations/puppet@production] webperf: Fix /excimer/ POST restriction

https://gerrit.wikimedia.org/r/919419