Page MenuHomePhabricator

Percentage of time in MW inline profiler is broken
Closed, ResolvedPublic

Description

This affects the WikimediaDebug "Inline profile" feature, which is backed by MediaWiki's ProfilerOutputText and ProfilerXhprof classes.

Example from https://en.wikipedia.org/w/load.php

1152600.00% 11.526     10 - Wikimedia\Services\ServiceContainer::getService
1151300.00% 11.513      4 - Wikimedia\Services\ServiceContainer::createService
1146100.00% 11.461      4 - Wikimedia\Services\ServiceContainer::{closure}
1046500.00% 10.465      1 - MediaWiki\MediaWikiServices::getResourceLoader
688600.00% 6.886    131 - ResourceLoader::register
594700.00% 5.947      1 - ResourceLoader::respond

Note how the percentage of main() is not 100% as it should be. They all seem to just be inflated copies of the millisecond measure.

Upstream: https://github.com/tideways/php-xhprof-extension/issues/103

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle triaged this task as Medium priority.Apr 6 2020, 8:19 PM

To reproduce this in production on mwdebug from the command-line (see T253547 for details, some of its workarounds are no longer needed, but the command is still fairly complex).

mwdebug1001$ echo 'wfTimestampNow(); Profiler::instance()->logDataPageOutputOnly();' | sudo -u www-data php -dauto_prepend_file=/srv/mediawiki/wmf-config/PhpAutoPrepend.php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki testwiki --profiler text
9839800.00% 98.398    164 - AutoLoader::autoload
5165400.00% 51.654      1 - ExtensionRegistry::loadFromQueue
3899100.00% 38.991     22 - MediaWiki\MediaWikiServices::getInstance
(…)
107000.00% 1.070      1 - Wikibase\Client\Hooks\EchoSetupHookHandlers::onBeforeCreateEchoEvent
104200.00% 1.042      1 - Wikimedia\Rdbms\LBFactory::getChronologyProtector
103400.00% 1.034      7 - Composer\Semver\VersionParser::parseConstraints

It seems that %real is somehow assigned realMs * 100 instead of realMs / totalMs * 100. So maybe there's a null ~0 or true ~1 somewhere in the mix causing trouble.

Going top down, starting at ProfilerOutputText.php to understand the following statement:

ProfilerOutputText.php
   $out .= sprintf( "%6.2f%% %3.3f %6d - %s\n",
       $item['%real'], $item['real'], $item['calls'], $item['name'] );
+  $out .= json_encode( $item ) . "\n";
10417000.00% 104.170    164 - AutoLoader::autoload
{"name":"AutoLoader::autoload","calls":164,"real":104.17000000000003,"%real":10417000.000000004,"cpu":103.94999999999987,"%cpu":5197499.9999999935,"memory":11659431.999999993,"%memory":9108931.249999994,"min_real":0.004,"max_real":12.0235}

5780300.00% 57.803      1 - ExtensionRegistry::loadFromQueue
{"name":"ExtensionRegistry::loadFromQueue","calls":1,"real":57.803,"%real":5780300,"cpu":57.753,"%cpu":2887650,"memory":6601120,"%memory":5157125,"min_real":57.803,"max_real":57.803}

(…)

This rules out any obvious errors with the variables here or the sprintf formatting. Next level down is ProfilerXhprof::getFunctionStats, as called in Profiler.php to feed ProfilerOutputText.

ProfilerXhprof.php
  public function getFunctionStats() {
      // …
      foreach ( $metrics as $fname => $stats ) {
+         echo __METHOD__ . sprintf(' # %s : wt_total = %s, wt_percent = %s', $fname, $stats['wt']['total'], $stats['wt']['percent'], $main['real'] ) . "\n";
          // …
              'real' => $stats['wt']['total'] / 1000,
              '%real' => $stats['wt']['percent'],
ProfilerXhprof::getFunctionStats # AutoLoader::autoload : wt_total = 95093, wt_percent = 9509300
ProfilerXhprof::getFunctionStats # ExtensionRegistry::readFromQueue : wt_total = 29573, wt_percent = 2957300
ProfilerXhprof::getFunctionStats # main() : wt_total = 1, wt_percent = 100
ProfilerXhprof::getFunctionStats # BounceHandlerHooks::onBeforeCreateEchoEvent : wt_total = 1, wt_percent = 100
ProfilerXhprof::getFunctionStats # FauxRequest::getRawIP : wt_total = 1, wt_percent = 100
(…)
ProfilerXhprof::getFunctionStats # $main = array(
    [name] => main()
    [calls] => 1
    [real] => 0.002
    [%real] => 100
    [cpu] => 0.001
    [%cpu] => 100
    [memory] => 128
    [%memory] => 100
    [min_real] => 0.002
    [max_real] => 0.002
)

Next level down is XhprofData::getCompleteMetrics:

XhprofData.php
+     echo __METHOD__ . " # MAIN = " . json_encode($main) . "\n";
      foreach ( $this->hieraData as $key => $stats ) {
+         echo __METHOD__ . " # $key " . json_encode($stats) . "\n";
          list( $parent, $child ) = self::splitKey( $key );
XhprofData::getInclusiveMetrics # MAIN = {"ct":1,"wt":1,"cpu":1,"mu":128,"pmu":0}

XhprofData::getInclusiveMetrics # ExtensionRegistry::loadFromQueue {"ct":1,"wt":57785,"cpu":57786,"mu":6601120,"pmu":7783528}

XhprofData::getInclusiveMetrics # main() {"ct":1,"wt":1,"cpu":1,"mu":128,"pmu":0}

XhprofData::getInclusiveMetrics # MediaWiki\MediaWikiServices::getInstance {"ct":3,"wt":43401,"cpu":43401,"mu":3836808,"pmu":3315488}
XhprofData::getInclusiveMetrics # ConfigFactory::getDefaultInstance==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":2,"cpu":2,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # DeferredUpdates::doUpdates==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":1,"cpu":1,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Hooks::run@1==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":1,"cpu":1,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Hooks::run==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":2,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Hooks::runner==>MediaWiki\MediaWikiServices::getInstance {"ct":2,"wt":5,"cpu":5,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Maintenance::setAgentAndTriggers==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":2,"cpu":1,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Maintenance::setLBFactoryTriggers==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":0,"cpu":1,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # Maintenance::shutdown==>MediaWiki\MediaWikiServices::getInstance {"ct":2,"wt":2,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # MediaWiki\Languages\LanguageFactory::getLanguage==>MediaWiki\MediaWikiServices::getInstance {"ct":2,"wt":3,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # MediaWiki\Session\SessionManager::__construct==>MediaWiki\MediaWikiServices::getInstance {"ct":2,"wt":2,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # ObjectCache::newFromParams==>MediaWiki\MediaWikiServices::getInstance {"ct":2,"wt":2,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # OutputPage::getHookContainer==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":2,"cpu":3,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # RequestContext::getConfig==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":2,"cpu":2,"mu":128,"pmu":0}
XhprofData::getInclusiveMetrics # wfLogProfilingData==>MediaWiki\MediaWikiServices::getInstance {"ct":1,"wt":1,"cpu":2,"mu":128,"pmu":0}
5778500.00% 57.785      1 - ExtensionRegistry::loadFromQueue
4342800.00% 43.428     22 - MediaWiki\MediaWikiServices::getInstance
(…)

So it looks like actually the percentage calculation might not be at fault, and indeed possibly none of our profiling code. Rather it seems the main() entry provided by tideways-xhprof is faulty, as it just reports wall-time and cpu time as a constant 1 millisecond, whereas it is meant to be the inclusive total.

main() {"ct":1,"wt":1,"cpu":1,"mu":128,"pmu":0}

Okay, a couple more goose chases later:

Is everything broken? Post-send only? No.

The strange thing is, that XHGui consumes the exact same data with (as far as I can see) no processing step of any kind. It directly reads out $data['main()']['wt'] and finds a reasonable total value – not a constant 1. The only difference between MediaWiki Profiler reading out the data, and WMF's XHGui client in wmf-config reading out the data is timing. MediaWiki naturally reads it out shortly before closing the web response or CLI standard output. Whereas for XHGui we wait until post-send request shutdown. Could it be that Tideways (unlike php5-xhprof/hhvm-xhprof) somehow doesn't populate main() request shutdown? Thus making it not possible to freely enable() and disable() at will? That would explain why it is broken in this way for web responses and CLI processes that seek to provide the profile to the end-user, in constrast to WikimediaDebug/XHGui which only need it post-send to send to XHGui database.

But.. this isn't supported by evidence. Opening php -a and just enabling and disabling tideways_xhprof with the same flags as production, yields a perfectly fine profile with a sensible main() entry. E.g.:

php -a
function foo() { sleep(1); }
function bar() { foo(); foo(); }

$xhprofFlags = TIDEWAYS_XHPROF_FLAGS_CPU | TIDEWAYS_XHPROF_FLAGS_MEMORY | TIDEWAYS_XHPROF_FLAGS_NO_BUILTINS;
tideways_xhprof_enable($xhprofFlags);
sleep(1);
tideways_xhprof_enable($xhprofFlags); bar(); var_dump(tideways_xhprof_disable());

//>
array(3) {
  ["main()==>bar"]=> array(5) {
    ["ct"]=> int(1)
    ["wt"]=> int(2000258)
    ["cpu"]=> int(96)
    ["mu"]=> int(352)
    ["pmu"]=> int(0)
  }
  ["main()"]=> array(5) {
    ["ct"]=> int(1)
    ["wt"]=> int(2971475)
    ["cpu"]=> int(1597)
    ["mu"]=> int(656)
    ["pmu"]=> int(0)
  }
  ["bar==>foo"]=> array(5) {
    ["ct"]=> int(2)
    ["wt"]=> int(2000227)
    ["cpu"]=> int(77)
    ["mu"]=> int(128)
    ["pmu"]=> int(0)
Enabling twice? No.

We do something a little funny in wmf-config, which is that we call tideways_xhprof_enable but then shortly later in the early setup phase of MediaWiki core, ProfilerXhprof also calls this once more. As far as I know, this is innocent and gets us around the issue of separating concerns between core and WMF, whilst also getting to capture the early multiversion setup phases prior to the profiler being loaded etc.

I ruled this out by changing this, but getting the same result:

- tideways_xhprof_enable($xhprofFlags); bar(); var_dump(tideways_xhprof_disable());
+ tideways_xhprof_enable($xhprofFlags); sleep(1); tideways_xhprof_enable($xhprofFlags); bar(); var_dump(tideways_xhprof_disable());

So, enabling twice seems fine with Tideways, just as it was with php5-xhprof and hhvm-xhprof.

Then what?

I decided to do away with the Profiler::instance()->logDataPageOutputOnly(); call. While the previous investigations have rule out virtually all that it does as a cause, maybe just being there at all is somehow messing with Tideways. E.g. the fact that it is calling _disable() while there is a call stack, whereas it started in main() in profiler.php.

mwdebug1001$ echo 'wfTimestampNow(); json_encode(tideways_xhprof_disable(), JSON_PRETTY_PRINT);' | sudo -u www-data php -dauto_prepend_file=/srv/mediawiki/wmf-config/PhpAutoPrepend.php /srv/mediawiki/multiversion/MWScript.php eval.php --wiki testwiki --profiler text
()
    "main()": {
        "ct": 1,
        "wt": 1,
        "cpu": 1,
        "mu": 128,
        "pmu": 0
    },
()

But alas, that didn't seem to matter. What about the enabling part? The "Enabling twice" thing was still nagging at me, even though it seemed to be a no-op, I only tested that from the top-level. Our redundant/second enablement is nested, just like the disablement we just ruled out.

Come on!

OK. Let's brinb back the "Enable twice" scenario, but this time put the second enablement in another function:

function foo() { sleep(0.1); }
function bar() { foo(); foo(); }

function reallyDoEnable() { $xhprofFlags = TIDEWAYS_XHPROF_FLAGS_CPU | TIDEWAYS_XHPROF_FLAGS_MEMORY | TIDEWAYS_XHPROF_FLAGS_NO_BUILTINS; tideways_xhprof_enable($xhprofFlags); }
function doEnable() { sleep(0.3); reallyDoEnable(); }
$xhprofFlags = TIDEWAYS_XHPROF_FLAGS_CPU | TIDEWAYS_XHPROF_FLAGS_MEMORY | TIDEWAYS_XHPROF_FLAGS_NO_BUILTINS;
Second top-level enable
tideways_xhprof_enable($xhprofFlags); sleep(0.3); bar(); echo json_encode(tideways_xhprof_disable(),JSON_PRETTY_PRINT);
{
    "main()==>bar": {
        "ct": 1,
        "wt": 154, # GOOD, working fine
        "cpu": 36, # GOOD
        "mu": 352,
        "pmu": 0
    },
    "main()": {
        "ct": 1,
        "wt": 251,
        "cpu": 64,
        "mu": 480,
        "pmu": 0
    },
    "bar==>foo": {
        "ct": 2,
        "wt": 146,
        "cpu": 29,
        "mu": 128,
        "pmu": 0
    }
}
Second nested enable
tideways_xhprof_enable($xhprofFlags); doEnable(); bar(); echo json_encode(tideways_xhprof_disable(),JSON_PRETTY_PRINT);
{
    "main()": {
        "ct": 1,
        "wt": 3,  # BUG! There you are!
        "cpu": 4, # BUG
        "mu": 128,
        "pmu": 0
    },
    "bar": {
        "ct": 1,
        "wt": 163,
        "cpu": 38,
        "mu": 352,
        "pmu": 0
    },
    "bar==>foo": {
        "ct": 2,
        "wt": 157,
        "cpu": 32,
        "mu": 128,
        "pmu": 0
    }
}

And yes, I've also confirmed that if there is only one enablement and that that one is nested, it all works fine. It's specific to there being a second enablement that is nested.

  1. OK: Top-level enable with top-level disable
  2. OK: Top-level enable with nested disable
  3. OK: Nested enable with nested disable
  4. OK: Multiple top-level enable with top-level disable
  5. OK: Multiple top-level enable with nested disable
  6. CORRUPT: Top-level and nested enable with top-level disable
  7. CORRUPT: Top-level and nested enable with nested disable

🎉

Krinkle edited projects, added Upstream; removed patch-welcome, Regression.
Krinkle moved this task from Backlog to Reported Upstream on the Upstream board.

Change 719160 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] profiler: Document ProfilerXhprof class options

https://gerrit.wikimedia.org/r/719160

Change 719161 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] profiler: Add 'running' option to ProfilerXhprof

https://gerrit.wikimedia.org/r/719161

Change 719160 merged by jenkins-bot:

[mediawiki/core@master] profiler: Document ProfilerXhprof class options

https://gerrit.wikimedia.org/r/719160

Change 719161 merged by jenkins-bot:

[mediawiki/core@master] profiler: Add 'running' option to ProfilerXhprof

https://gerrit.wikimedia.org/r/719161

Change 721417 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] profiler: Fixup broken logic for the new 'running' option

https://gerrit.wikimedia.org/r/721417

Change 721417 merged by jenkins-bot:

[mediawiki/core@master] profiler: Fixup broken logic for the new 'running' option

https://gerrit.wikimedia.org/r/721417

Change 723338 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] profiler: Use the ProfilerXhprof 'running' option when profiling

https://gerrit.wikimedia.org/r/723338

Change 723338 merged by jenkins-bot:

[operations/mediawiki-config@master] profiler: Use the ProfilerXhprof 'running' option when profiling

https://gerrit.wikimedia.org/r/723338

Change 723348 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@REL1_37] profiler: Fixup broken logic for the new 'running' option

https://gerrit.wikimedia.org/r/723348

Change 723348 merged by jenkins-bot:

[mediawiki/core@REL1_37] profiler: Fixup broken logic for the new 'running' option

https://gerrit.wikimedia.org/r/723348