Page MenuHomePhabricator

/usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam
Closed, ResolvedPublic

Description

This has started happening today at Date: Thu, 29 Jun 2017 01:15:36 +0000. Looks like malformed input, possibly pulled from redis? cc @Krinkle

Ignored 1 lines with invalid format
Use of uninitialized value $stack in pattern match (m//) at /usr/local/bin/flamegraph.pl line 499, <> line 8281.
Use of uninitialized value $stack in split at /usr/local/bin/flamegraph.pl line 504, <> line 8281.
Use of uninitialized value $samples in concatenation (.) or string at /usr/local/bin/flamegraph.pl line 504, <> line 8281.
Ignored 1 lines with invalid format
Ignored 1 lines with invalid format
Use of uninitialized value $stack in pattern match (m//) at /usr/local/bin/flamegraph.pl line 499, <> line 1539.
Use of uninitialized value $stack in split at /usr/local/bin/flamegraph.pl line 504, <> line 1539.
Use of uninitialized value $samples in concatenation (.) or string at /usr/local/bin/flamegraph.pl line 504, <> line 1539.
Ignored 1 lines with invalid format

Event Timeline

This must've been triggered by https://gerrit.wikimedia.org/r/#/c/362114/ which is the only change Xenon has seen since several months.
rOPUP6dfacc0111c9: xenon: Increase hourly retention from 1 to 14 days

However, this was just a simple config change.

Indeed seems unlikely that's the root cause, the spam stopped at Date: Fri, 30 Jun 2017 00:01:31 +0000 btw so I suspect one/multiple bad records yesterday but not today. While looking into it I noticed our version of flamegraph.pl is behind on https://github.com/brendangregg/FlameGraph/blob/master/flamegraph.pl but the diff didn't show anything significant for this bug (P5655)

Krinkle lowered the priority of this task from Medium to Low.Aug 8 2017, 2:56 AM

@Gilles yeah looks like the spam does recur every now and then

2017-09-12-130901_906x414_scrot.png (414×906 px, 80 KB)

Change 377451 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] [WIP] Avoid warnings for invalid lines in reverse-stack mode

https://gerrit.wikimedia.org/r/377451

aaron removed aaron as the assignee of this task.Feb 14 2018, 9:48 PM
aaron subscribed.

Change 377451 abandoned by Aaron Schulz:
Avoid perl warnings for invalid lines in reverse-stack mode

https://gerrit.wikimedia.org/r/377451

Let's try updating to the latest upstream version, and see if we still get these errors, and if so, I'll file an upstream report.

I don't think it's flamegraph.pl's fault, the issue is with the last line of the log file, which seems to be truncated in the repro example I found:

index.php;{GET};MediaWiki::run;MediaWiki::main;MediaWiki::performRequest;SpecialPageFactory::executePath;SpecialPage::run;SpecialCollection::execute;SpecialCollection::renderCollection;CollectionRenderingAPI::render;CollectionRenderingAPI::doRender;MWServeRenderingAPI::

Looking at /usr/local/bin/xenon-log, we don't flush when we write to the file (nor do we ever close it...):

with open(file_path, 'a') as f:
    print(message, file=f)

Forcibly flushing should probably help avoid occasionally getting a truncated last line.

Change 426224 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Make xenon-log line-buffered

https://gerrit.wikimedia.org/r/426224

I don't think it's flamegraph.pl's fault, the issue is with the last line of the log file, which seems to be truncated in the repro example I found:

index.php;{GET};MediaWiki::run;MediaWiki::main;MediaWiki::performRequest;SpecialPageFactory::executePath;SpecialPage::run;SpecialCollection::execute;SpecialCollection::renderCollection;CollectionRenderingAPI::render;CollectionRenderingAPI::doRender;MWServeRenderingAPI::

Looking at /usr/local/bin/xenon-log, we don't flush when we write to the file (nor do we ever close it...):

Good catch! Looks like indeed the line is truncated. The file will be closed as soon as the with block is done btw.

Change 426224 merged by Filippo Giunchedi:
[operations/puppet@production] Make xenon-log line-buffered

https://gerrit.wikimedia.org/r/426224

Mentioned in SAL (#wikimedia-operations) [2018-04-17T09:17:06Z] <godog> restart xenon-log on mwlog* - T169249

Merged the change, let's wait and see if truncation reoccurs

How frequent were they as of late, before the change?

Somewhat frequent but not a lot, I don't have exact numbers tho

2018-04-19-142527_863x525_scrot.png (525×863 px, 103 KB)

Is the Apr 18 occurence on that screenshot after the change was deployed?

Looking at the content of the emails @fgiunchedi forwarded to me, the last offense right when the next cron kicks in after the restart and keeps going for the rest of the hour on the hourly "all" log, and for the rest of the day on the daily "all" log.

The proof being that the last cron run of the day (for which the log file should be "complete") hits the error on the very last line of the file as it is now:

From: Cron Daemon <root@mwlog1001.eqiad.wmnet>
Date: Wed, Apr 18, 2018 at 2:02 AM
To: xenon@mwlog1001.eqiad.wmnet


Ignored 1 lines with invalid format
Use of uninitialized value $stack in pattern match (m//) at /usr/local/bin/flamegraph.pl line 499, <> line 183917.
Use of uninitialized value $stack in split at /usr/local/bin/flamegraph.pl line 504, <> line 183917.
Use of uninitialized value $samples in concatenation (.) or string at /usr/local/bin/flamegraph.pl line 504, <> line 183917.
gilles@mwlog1001:/srv/xenon/logs/daily$ cat 2018-04-17.all.log | wc -l
183917

It looks possible that the restart itself caused the old version of xenon-log to cut off a line halfway, and the affected logs kept carrying the issue for as long as they kept going. Which in itself isn't proof one way or another that the attempted fix didn't work.

However, when I pull that file in its current state locally, the flamegraph SVGs generate without error.

Two things don't make sense:

  • if the file is being appended to, why is the failure always the last line? that could suggest a line that hasn't been fully written yet, which was possible with the old log generator
  • if the new version fixes the problem, why did the restart seem to trigger the issue, and why did it "fix itself" eventually, after the last cron run of the day happened

Either this last occurence was some freak event due to the update/restart, or line-buffering isn't enough to ensure no incomplete last line when the file is consumed by the SVG generator.

We can wait to see if it occurs again. However, given that the issue is consistant with the SVG generator attempting to generate SVGs for the ongoing day/hour, maybe we should avoid generating those SVGs until the day is over, or for as long as they're being written to? I'm not sure that they're super useful when they're incomplete anyway.

Change 427665 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Xenon: don’t generate SVGs for recently modified logs

https://gerrit.wikimedia.org/r/427665

Change 427665 abandoned by Gilles:
Xenon: don’t generate SVGs for recently modified logs

https://gerrit.wikimedia.org/r/427665

Hah, turns out I actually didn't ever truly run xenon-generate-svgs locally on that file, when I did it failed just like in production. Which means the files doesn't eventually fix itself, making the above change useless. Now, it's not immediately clear to me what makes that last line broken:

api.php;{POST};ApiMain::execute;ApiMain::executeActionWithErrorHandling;ApiMain::executeAction;ApiParsoidBatch::execute;ApiParsoidBatch::preprocess;StubObject::__call;StubObject::_call;Parser::preprocess;Parser::replaceVariables;PPFrame_Hash::expand;Parser::braceSubstitution;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ScribuntoHooks::invokeHook;Scribunto_LuaSandboxEngine::getResourceUsage;Scribunto_LuaEngine::load;Scribunto_LuaEngine::instantiatePHPLibrary;Wikibase\Client\DataAccess\Scribunto\Scribunto_LuaWikibaseLibrary::register;Scribunto_LuaEngine::registerInterface;Scribunto_LuaSandboxInterpreter::registerLibrary;LuaSandbox::registerLibrary 1
This comment was removed by Gilles.

Finally found the offending line, which is just "1" and happened in the middle of the file. It's possible that it was written by the old version of xenon-log The rest of the cron spam makes sense, since the file is permanently tainted.

Since this broken line happens in the middle of the file and doesn't look like a truncated line (the integer is normally the last part of a line), it might be a different bug than the one I saw before with the last line of the log file being truncated.

Looking at xenon-log, it seems to be writing the data as it gets it from redis on the "xenon" pubsub channel as-is. The issue is probably coming from xenon itself. That being said, we could simply filter out such invalid messages in xenon-log.

Change 427816 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Filter out invalid records in xenon-log

https://gerrit.wikimedia.org/r/427816

I don't know if the cited portion was changed unintentionally, but that sample does not show a 1 on its own line. It shows a 1 separated by a single space from a very long line (it just visually renders with wrapping to the next line).

Anyway, assuming a single line with just a number did exist in the output, the xenon-log might not be the best place to filter that because xenon-log is not the process that consumes xenon data. That consumer is in wmf-config/profiler.php. The xenon-log process should ideally be as simple as possible and only write from Redis to a file.

If we do want validation afterwards we could consider working with upstream flamegraphs.pl to reject lines that are not in format "<non-space>+<space><int>+". But, looking at the following line of the xenon consumer:

$redis->publish( 'xenon', "$stack $count" );

I guess $stack can be an empty string, which is invalid and so we shouldn't produce it there.

@Gilles Can you confirm that there are lines in the log consisting of "<space><int>" without the preceding string?

There's no leading space:

gilles@mwlog1001:~$ sed -n '58775,58777p' /srv/xenon/logs/daily/2018-04-17.all.log 
api.php;{GET};ApiMain::execute;ApiMain::executeActionWithErrorHandling;ApiMain::executeAction;ApiMobileView::execute;ApiMobileView::getData;ApiMobileView::getParserOutput;WikiPage::getParserOutput;PoolCounterWork::execute;PoolWorkArticleView::doWork;AbstractContent::getParserOutput;WikitextContent::fillParserOutput;StubObject::__call;StubObject::_call;Parser::parse;Parser::internalParse;Parser::replaceVariables;PPFrame_Hash::expand;Parser::braceSubstitution;PPFrame_Hash::expand;Parser::braceSubstitution;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ScribuntoHooks::invokeHook;Scribunto_LuaModule::invoke;Scribunto_LuaEngine::executeFunctionChunk;Scribunto_LuaSandboxInterpreter::callFunction;LuaSandboxFunction::call;Scribunto_LuaSandboxCallback::__call;Scribunto_LuaEngine::expandTemplate;Scribunto_LuaEngine::doCachedExpansion;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ExtParserFunctions::ifObj;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ExtParserFunctions::ifObj;PPFrame_Hash::expand;Parser::braceSubstitution;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ExtParserFunctions::ifObj;PPFrame_Hash::expand;Parser::braceSubstitution;PPFrame_Hash::expand;Parser::braceSubstitution;Parser::callParserFunction;ScribuntoHooks::invokeHook;Scribunto_LuaModule::invoke;Scribunto_LuaEngine::executeFunctionChunk;Scribunto_LuaSandboxInterpreter::callFunction;LuaSandboxFunction::call;Scribunto_LuaSandboxCallback::__call;Wikibase\Client\DataAccess\Scribunto\Scribunto_LuaWikibaseLibrary::getEntity;Wikibase\Client\DataAccess\Scribunto\EntityAccessor::getEntity;Wikibase\Client\Serializer\ClientEntitySerializer::serialize;Serializers\DispatchingSerializer::serialize;Wikibase\DataModel\Serializers\ItemSerializer::serialize;Wikibase\DataModel\Serializers\ItemSerializer::getSerialized;Wikibase\DataModel\Serializers\ItemSerializer::addStatementListToSerialization;Wikibase\DataModel\Serializers\StatementListSerializer::serialize;Wikibase\DataModel\Serializers\StatementListSerializer::getSerialized;Wikibase\DataModel\Serializers\StatementSerializer::serialize;Wikibase\DataModel\Serializers\StatementSerializer::getSerialized;Wikibase\DataModel\Serializers\StatementSerializer::addReferencesToSerialization;Wikibase\DataModel\Serializers\ReferenceListSerializer::serialize;Wikibase\DataModel\Serializers\ReferenceListSerializer::getSerialized;Wikibase\DataModel\Serializers\ReferenceSerializer::serialize;Wikibase\DataModel\Serializers\ReferenceSerializer::getSerialized;Wikibase\DataModel\Serializers\ReferenceSerializer::buildSnakListOrderList 1
1
index.php;{GET};MediaWiki::run;MediaWiki::main;Closure$MediaWiki::main;OutputPage::output;OutputPage::sendCacheControl;OutputPage::getVaryHeader;OutputPage::getCacheVaryCookies;MediaWiki\Session\SessionManager::getVaryCookies 1

With no explanation as to how this can get published to Redis, xenon-log is the right place to filter it, before the data gets passed to flamegraph.pl.

I've confirmed that the PHP Redis client we use will incorrectly remove the leading space:

gilles@terbium:~$ mwscript eval.php --wiki=enwiki
> $redis = new Redis();

> $redis->connect( 'mwlog1001.eqiad.wmnet', 6379, 0.1 );

> $redis->publish( 'test', ' 1');
gilles@mwlog1001:~$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> conn = redis.Redis(host='localhost', port=6379)
>>> pubsub = conn.pubsub()
>>> pubsub.subscribe('test')
>>> for message in pubsub.listen():
...   print message['data']
... 
1

I'm guessing this comes from the fact that in the Redis protocol command operands are space-separated.

So we do have an explanation about where those are coming from and we can filter in profiler.php instead.

Change 427816 abandoned by Gilles:
Filter out invalid records in xenon-log

https://gerrit.wikimedia.org/r/427816

Found the explanation... the "1" I was getting from the python client wasn't the message I sent (in my test above I had sent it before starting the python client, and mistook that first "1" message for the one I had sent), it's something that happens every time the python client connects to the pubsub channel.

So what we saw on the 17th was when the script restarted, it connected and it got an initial message, which isn't a real one and has "1" as the data portion of the message. It's probably simply a confirmation that we've connected correctly to the channel.

As a result, every time xenon-log is restarted, it appends "1" on its own line to the current file, creating borked output.

Here's an example of the python client's behavior on a channel I've never written to:

gilles@mwlog1001:~$ python
Python 2.7.9 (default, Jun 29 2016, 13:08:31) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import redis
>>> conn = redis.Redis(host='localhost', port=6379)
>>> pubsub = conn.pubsub()
>>> pubsub.subscribe('test2')
>>> for message in pubsub.listen():
...   print message['data']
... 
1

The full message:

{'pattern': None, 'type': 'subscribe', 'channel': 'test2', 'data': 1L}

An actual message pushed afterwards:

{'pattern': None, 'type': 'message', 'channel': 'test2', 'data': ' 1'}

Change 427870 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Ignore Redis subscription message in xenon-log

https://gerrit.wikimedia.org/r/427870

JDWHILL set Security to Software security bug.Apr 20 2018, 8:26 AM
JDWHILL added a project: acl*security.
JDWHILL changed the visibility from "Public (No Login Required)" to "Custom Policy".
JDWHILL subscribed.
Gilles changed the visibility from "Custom Policy" to "Public (No Login Required)".Apr 20 2018, 8:29 AM
Gilles removed a project: acl*security.

Change 427870 merged by Filippo Giunchedi:
[operations/puppet@production] Ignore Redis subscription message in xenon-log

https://gerrit.wikimedia.org/r/427870

So far so good, the restart didn't cause any cronspam this time around

Since I'm very confident about the fix, I'm going to close this task, feel free to reopen if the issue reoccurs.

Change 447351 had a related patch set uploaded (by Krinkle; owner: Gilles):
[performance/arc-lamp@master] xenon-log: Ignore the Redis subscription message

https://gerrit.wikimedia.org/r/447351

Change 447351 merged by Krinkle:
[performance/arc-lamp@master] xenon-log: Ignore the Redis subscription message

https://gerrit.wikimedia.org/r/447351