Page MenuHomePhabricator

Debug memory leak in maintenance script
Closed, ResolvedPublic

Assigned To
Authored By
matmarex
Nov 21 2023, 3:38 PM
Referenced Files
F41721518: pprof-flamegraph.png
Jan 27 2024, 3:52 AM
F41721517: pprof-flamegraph.svg
Jan 27 2024, 3:52 AM
F41721451: pprof-flamegraph.png
Jan 27 2024, 3:29 AM
F41721450: pprof-flamegraph.svg
Jan 27 2024, 3:29 AM
F41721349: pprof.png
Jan 27 2024, 3:12 AM
F41721348: pprof-flamegraph.png
Jan 27 2024, 3:12 AM
F41721346: pprof.svg
Jan 27 2024, 3:12 AM
F41721347: pprof-flamegraph.svg
Jan 27 2024, 3:12 AM

Description

Following up from T315510#9314495:

Event Timeline

Change 993020 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] DeprecationHelper: Don't copy the info into every instance

https://gerrit.wikimedia.org/r/993020

I set up php-memory-profiler and instrumented the code as follows:

diff --git a/maintenance/persistRevisionThreadItems.php b/maintenance/persistRevisionThreadItems.php
index 61873715..0a59ca03 100644
--- a/maintenance/persistRevisionThreadItems.php
+++ b/maintenance/persistRevisionThreadItems.php
@@ -190,6 +190,10 @@ class PersistRevisionThreadItems extends Maintenance {
 		$duration = microtime( true ) - $start;
 		$durationFormatted = $this->lang->formatTimePeriod( $duration );
 		$this->output( "Finished in $durationFormatted\n" );
+
+		gc_collect_cycles();
+		memprof_dump_callgrind( fopen( "a.callgrind", "w" ) );
+		file_put_contents( "a.json", json_encode( memprof_dump_array() ) );
 	}
 
 	/**
@@ -208,6 +212,10 @@ class PersistRevisionThreadItems extends Maintenance {
 
 				// Store permalink data (even when store is disabled - T334258)
 				$changed = $this->itemStore->insertThreadItems( $rev, $threadItemSet );
+
+				if ( !memprof_enabled() ) {
+					memprof_enable();
+				}
 			}
 		} catch ( Throwable $e ) {
 			$this->output( "Error while processing revid=$row->rev_id, pageid=$row->rev_page\n" );

This should output the information about which functions allocated all of the memory after the 2nd iteration of the script (this should allow us to skip most of the data used by various initialization code) that is still live when the script finishes.

I got this:

Then I tried to make sense of the results in QCacheGrind.

a.png (923×2 px, 438 KB)

I also wrote myself a script to summarize the JSON output.

{
  "AutoLoader::autoload": 2803248,
  "MediaWiki\\Title\\Title::deprecatePublicProperty": 2353008,
  "MapCacheLRU::set": 2174224,
  "Composer\\Autoload\\ClassLoader::loadClass": 1544156,
  "Wikimedia\\Parsoid\\Utils\\PHPUtils::jsonDecode": 1405208,
  "LinkCache::addBadLinkObj": 956920,
  "Exception::getTraceAsString": 950272,
  "MediaWiki\\Extension\\QuickInstantCommons\\Repo::fetchImageQuery": 509896,
  "MediaWiki\\Title\\Title::newFromTextThrow": 443520,
  "MediaWiki\\Title\\Title::deprecatePublicPropertyFallback": 392168,
  "mysqli_result::fetch_object": 366264,
  "Preprocessor_Hash::buildDomTreeArrayFromText": 213560,
  "LocalisationCache::readPHPFile": 182288,
  "MediaWiki\\Extension\\Scribunto\\Engines\\LuaStandalone\\LuaStandaloneInterpreterFunction::incrementRefCount": 161688,
  "SqlBagOStuff::unserialize": 160080,
  "Preprocessor_Hash::preprocessToObj": 160064,
  "LinkCache::addGoodLinkObjFromRow": 147768,
  "array_flip": 132848,
  "MediaWiki\\Title\\MediaWikiTitleCodec::getPrefixedDBkey": 125584,
  "Wikimedia\\Parsoid\\Wt2Html\\XMLSerializer::serialize": 116328,
  "MediaWiki\\Extension\\CLDR\\LanguageNames::loadLanguage": 110496,
  "MediaWiki\\Title\\Title::secureAndSplit": 85768,
  "file_get_contents": 83680,
  "MediaWiki\\Extension\\ParserFunctions\\ParserFunctions::time": 78096,
  "Wikimedia\\CSS\\Grammar\\FunctionMatcher::__construct": 71688,
  "MediaWiki\\SyntaxHighlight\\Pygmentize::getLexers": 70936,
  "require /var/www/html/w/extensions/TimedMediaHandler/includes/TimedTextPage.php": 69704,
  "MapCacheLRU::setField": 67536,
  "MediaWiki\\Title\\MediaWikiTitleCodec::formatTitle": 66824,
  "DOMElement::getAttribute": 62688
}

I am not entirely sure what I'm looking at, to be honest. I think I understood some things, though:

  • There are lots of little bits and pieces loaded by AutoLoader and Composer, presumably for lazy-loaded classes that were needed only to parse some pages.
  • LinkCache and MapCacheLRU are expected, they're used for various caches of large but limited size.
  • LocalisationCache and LanguageNames are also expected, but I wonder if these could be a part of the problem. Language data can be big, and I don't think it's ever dropped…
  • It's surprising to see MediaWiki\Title\Title::deprecatePublicProperty. Apparently info about deprecations is copied to each Title object. I don't think it leaks though, it's just associated with live objects, so this wouldn't fix this problem. Anyway, worth patching: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/993020
  • It's also surprising to see various Parsoid classes, and also Exception::getTraceAsString near the top. Also various extensions called by the Parser. I haven't figured out yet why this data is hanging around.

I repeated the test after that patch, with expected results – MediaWiki\Title\Title::deprecatePublicProperty no longer features prominently:

b.png (923×2 px, 460 KB)

To be continued, I guess…

So basically there are three problems with these visualizations:

  • The callgrind format doesn't track full stack traces, but only pairs of caller-callee. Besides the obvious problem (when A calls B, and both C and D call B, you can't distinguish these two ways of calling A), it also means that it has all kinds of problems visualizing recursive functions (e.g. sometimes the numbers come out above 100%). QCacheGrind has an option to "detect cycles" and improve this, but it doesn't work at all here (I disabled it in the visualizations above), because…
  • Autoloader calls are also tracked, which would otherwise be nice (it's actually interesting to learn how much memory is taken up by static data in various classes loaded on-demand, but for this investigation, I don't care), but this interacts particularly poorly with with cycle detection – it usually finds that spl_autoload_call() calls into spl_autoload_call(), and collapses huge chunks of the visualization into spurious cycles.
  • There is a lot of space taken up by large but well-behaved caches, which limit their size and thus can't leak infinite memory. One of these caches is ParserOutputAccess::localCache (I happen to remember it from another memory leak investigation, where I added the limit), which caches ParserOutput objects, which are composed from lots of values generated all over the place – that's a big reason why various bits of Parser and Parsoid show up in the visualization.

I didn't quite solve them, but I think I solved 95% of them, and I feel like now we're getting somewhere:

  • php-memory-profiler can also output in the pprof format, which tracks full stack traces, and which can in turn output in the format used by (everyone's favorite) FlameGraph.
  • We can just load all of the classes ahead of time, and then they won't be loaded when we don't want to look at them. (Well, mostly. I had to hack around several classes that are registered in autoloader by some package, but which aren't used by MediaWiki, and which crash when loaded. There are also a few small things remaining which are somehow loaded via autoloader, but I couldn't figure out how to load them manually.)
  • Almost all of the well-behaved caches use MapCacheLRU, and so we can add a debug method in just one place that lets us clear all of their contents before capturing the data.

I now have these instrumentation hacks:

And these results (a flamegraph and pprof's own visualization):


pprof-flamegraph.png (2×1 px, 454 KB) pprof.png (4×1 px, 428 KB)

(note that this is a flamegraph of memory size – not of time, as is usual)

Observations (left-to-right on the flamegraph):

  • MediaWiki\Parser\ParserObserver::notifyParse is leaking data generated through Exception::getTraceAsString. It tries to detect duplicate parses, but it keeps metadata about every parse it sees forever. That cache should be limited. In this test, it grew to about 920 KB when processing ~1600 pages in ~1 minute, so it could plausibly grow to gigabytes when processing all of Wikipedia! (this comes out to a rate of about 1.3 GB per day, which is in the right order of magnitude, compared to what we've seen when running the script in production: T315510#9314964)
    • [Side note: the other issue with running the script in production: T315510#9314495, which was a much faster leak, is not explained here, but I think that was T353874 or another similar bug in Parsoid, affecting specific pages]
  • There is a similar but very small leak in MediumSpecificBagOStuff::trackDuplicateKeys, and it should be limited by the number of possible keys, doesn't seem worth fixing.
  • Parsoid permanently keeps various config data, e.g. in MediaWiki\Parser\Parsoid\Config\PageConfigFactory and MediaWiki\Parser\Parsoid\Config\SiteConfig, this is okay (just nothing since it's big on the visualization).
  • Big suspicious things are going on in Wikimedia\Parsoid\Wt2Html\TT\ExtensionHandler::onTag and Wikimedia\Parsoid\Wt2Html\TT\TemplateHandler::onTag, with very deep calls stacks ending in Scribunto, TemplateStyles, SyntaxHighlight, and RemexHtml. I don't know if they're leaks, well-behaved caches, one-time initialization, or some measurement artifacts.

I think I'm going to stop here, and pin the blame on MediaWiki\Parser\ParserObserver::notifyParse, until something proves otherwise.

Big suspicious things are going on (…) I don't know if they're leaks, well-behaved caches, one-time initialization, or some measurement artifacts.

I just realized I could easily check by just making the script run in a loop over the same pages, and comparing the shape of the flamegraphs. That won't catch all leaks (e.g. if something leaks per page title or per revision), but it will catch many (if it leaks per parse or per call). Here's the result when running it 5 times:


pprof-flamegraph.png (2×1 px, 463 KB)

Bad news is that MediaWiki\Extension\Scribunto\Engines\LuaStandalone\LuaStandaloneInterpreterFunction::incrementRefCount seems to actually be leaking :( Good news is that we don't use it in production! (we use LuaSandbox, which I can't be bothered to try to set up).

The rest of the suspicious things don't seem to grow, so that's nice.

Change 993206 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/993206

After this patch:


pprof-flamegraph.png (2×1 px, 496 KB)

Change 993020 merged by jenkins-bot:

[mediawiki/core@master] DeprecationHelper: Don't copy the info into every instance

https://gerrit.wikimedia.org/r/993020

Change 993206 merged by jenkins-bot:

[mediawiki/core@master] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/993206

Change 998384 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@wmf/1.42.0-wmf.16] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/998384

Change 998385 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@wmf/1.42.0-wmf.17] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/998385

Change 998384 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.16] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/998384

Change 998385 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.17] ParserObserver: Limit the size of cache of previous parse traces

https://gerrit.wikimedia.org/r/998385

Mentioned in SAL (#wikimedia-operations) [2024-02-07T14:50:45Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-07T14:52:14Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and matmarex: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-07T14:58:54Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:998384|ParserObserver: Limit the size of cache of previous parse traces (T351732)]], [[gerrit:998385|ParserObserver: Limit the size of cache of previous parse traces (T351732)]] (duration: 08m 08s)

FWIW, I think the scripts are still leaking some memory, though only slowly. persistRevisionThreadItems for enwiki was somewhere around 2G of memory yesterday evening, as far as I remember; now it’s just over 4G. (Or rather, between 4005M and 4007M as reported by htop; that might be slightly below “4G” depending on where you assume which units, power-of-two or power-of-ten. Doesn’t matter.) The ukwiki run will probably finish long before it runs into problems (it’s only at 993M so far), but enwiki might need a handful of restarts over the next month.