Page MenuHomePhabricator

Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(): PHP Warning: RedisException: Connection timed out
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
labels.normalized_message
[61938db2-c451-4f37-ba03-9c622af2b2b5] /w/index.php?go=Go&ns0=1&ns100=1&ns106=1&ns12=1&ns14=1&ns6=1&search=muslim%20women&title=Special:Search   PHP Warning: RedisException: Connection timed out
error.stack_trace
from /srv/mediawiki/src/Profiler.php(342)
#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/src/Profiler.php(342): trigger_error(string, integer)
#2 /srv/mediawiki/src/Profiler.php(247): Wikimedia\MWConfig\Profiler::excimerFlushToArclamp(array, array, string)
#3 /srv/mediawiki/php-1.41.0-wmf.30/includes/language/Message.php(1059): Wikimedia\MWConfig\Profiler::Wikimedia\MWConfig\{closure}(ExcimerLog)
#4 /srv/mediawiki/php-1.41.0-wmf.30/includes/linker/Linker.php(2207): Message->text()
#5 /srv/mediawiki/php-1.41.0-wmf.30/includes/linker/Linker.php(2388): MediaWiki\Linker\Linker::titleAttrib(string, array, array, DerivativeContext)
#6 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentLink.php(239): MediaWiki\Linker\Linker::tooltipAndAccesskeyAttribs(string, array, array)
#7 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentLink.php(183): MediaWiki\Skin\SkinComponentLink->applyLinkTitleAttribs(array, boolean, array)
#8 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentLink.php(254): MediaWiki\Skin\SkinComponentLink->makeLink(string, array, array)
#9 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentListItem.php(170): MediaWiki\Skin\SkinComponentLink->getTemplateData()
#10 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentListItem.php(221): MediaWiki\Skin\SkinComponentListItem->makeListItem(string, array, array, array)
#11 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/components/SkinComponentMenu.php(166): MediaWiki\Skin\SkinComponentListItem->getTemplateData()
#12 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/Skin.php(2470): MediaWiki\Skin\SkinComponentMenu->getTemplateData()
#13 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/SkinTemplate.php(630): Skin->getPortletData(string, array)
#14 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/SkinTemplate.php(594): SkinTemplate->getPortletsTemplateData()
#15 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/Skin.php(740): SkinTemplate->getCategoryLinks()
#16 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/SkinMustache.php(134): Skin->getCategories()
#17 /srv/mediawiki/php-1.41.0-wmf.30/skins/Vector/includes/SkinVectorLegacy.php(161): SkinMustache->getTemplateData()
#18 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/SkinMustache.php(92): MediaWiki\Skins\Vector\SkinVectorLegacy->getTemplateData()
#19 /srv/mediawiki/php-1.41.0-wmf.30/includes/skins/SkinTemplate.php(174): SkinMustache->generateHTML()
#20 /srv/mediawiki/php-1.41.0-wmf.30/includes/Output/OutputPage.php(2947): SkinTemplate->outputPage()
#21 /srv/mediawiki/php-1.41.0-wmf.30/includes/MediaWiki.php(978): MediaWiki\Output\OutputPage->output(boolean)
#22 /srv/mediawiki/php-1.41.0-wmf.30/includes/MediaWiki.php(613): MediaWiki->main()
#23 /srv/mediawiki/php-1.41.0-wmf.30/index.php(50): MediaWiki->run()
#24 /srv/mediawiki/php-1.41.0-wmf.30/index.php(46): wfIndexMain()
#25 /srv/mediawiki/w/index.php(3): require(string)
#26 {main}
Impact
Notes

Looks like our profiler push data to a Redis server which times out. That has started after promoting group 2 wikis to 1.41.0-wmf.31

excimer_redis_timeout.png (245×627 px, 14 KB)

If I understand it properly, the profiler is setup via wmf-config/PhpAutoPrepend.php which has:

Wikimedia\MWConfig\Profiler::setup( [
    'redis-host' => $wmgServiceConfig->getLocalService( 'xenon' ),
    'redis-port' => 6379,
    // Connection timeout, in seconds.
    'redis-timeout' => $wmgServiceConfig->getRealm() === 'labs' ? 1 : 0.1,

So that is a 100ms timeout for production and the Redis host is IP 10.64.16.88 with a comment stating it is arclamp1001.eqiad.wmnet.

Looks like @herron might know about that Redis server.

Event Timeline

We have a Redis dashboard but arclamp1001.eqiad.wmnet is not collected there.

On arclamp1001 host overview:

There are some TCP restransmits being shown since promoting 1.41.0-wmf.30 to group2 wikis (== more traffic):

arclamp_tcp_errors.png (322×899 px, 32 KB)

And some elevated tcp/attemptfails:

arclamp_tcp_attemptfails.png (329×907 px, 26 KB)

Change 965744 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] arclamp: add redis exporter and prom scrape config

https://gerrit.wikimedia.org/r/965744

Change 965744 merged by Herron:

[operations/puppet@production] arclamp: add redis exporter and prom scrape config

https://gerrit.wikimedia.org/r/965744

Change 966209 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: add redis_arclamp to redis jobs

https://gerrit.wikimedia.org/r/966209

Change 966209 merged by Herron:

[operations/puppet@production] prometheus: add redis_arclamp to redis jobs

https://gerrit.wikimedia.org/r/966209

We have a Redis dashboard but arclamp1001.eqiad.wmnet is not collected there.

With the above patches arclamp is now included in this dashboard as redis_arclamp

Since this involved deploying the redis prometheus exporters we also just got a related alert:

(RedisMemoryFull) firing: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull

Currently maxmemory is set to 1Mb

maxmemory                   => '1Mb',

Which seems quite small, but there is probably a story behind that. I'll upload a patch to double this and go from there.

Change 966216 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] arclamp::redis: double maxmemory

https://gerrit.wikimedia.org/r/966216

Change 966216 merged by Herron:

[operations/puppet@production] arclamp::redis: double maxmemory

https://gerrit.wikimedia.org/r/966216

10:52 AM <+jinxer-wm> (RedisMemoryFull) resolved: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull

I wasn't sure whether Redis monitoring was any needed, I only mentioned caused it might have potentially been useful. What is great to know is that enabling the exporter caused an alarm to be raise immediately which points to the root cause (not enough memory). Great!

Searching for RedisException on https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT still shows timeout issues, so I guess there is another problem beside the memory limit.

Currently maxmemory is set to 1Mb

maxmemory                   => '1Mb',

Which seems quite small, but there is probably a story behind that. I'll upload a patch to double this and go from there.

Traces all the way from arclamp::redis -> xenon::redis -> udp2log -> role::xenon -> redis::legacy in 2014.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/179791

No reason given. My guess is that this was done to maximally reduce risk given that the Redis pubsub was, at the time, hosted on the mwlog server. This combined with it being solely used for pubsub and nothing else, so it needs relatively little buffer.

Two categorical changes took place since then:

  • We increased the sampling rate (HHVM->PHP7).
  • We increased the maximum stack depth (size of 1 event).
  • We 2X events by adding a variant (excimer-cpu + excimer-wall).
  • We 2X the channels (excimer, excimer-wall, + a k8s variant for each).
  • Increased backend traffic and thus increased samples through reduced CDN maxage, and reduced CDN layering.
  • Increased backend traffic through organic growth.
  • Increased backend traffic from attacks.

While I don't have an estimate of how many messages are in the buffer to fill 1Mb, it seems not too surprising that that'd fill up.

This Redis instance is now on dedicated arclamp hosts instead of co-located on mwlog/udp2log. So the risk is quite significantly reduced by essentially being out of the Tier-1 path.

I would suggest increasing it to 10Mb at least. That way the alert threshold of 0.98 is more likely to trigger some days/weeks ahead of the service going down instead of essentially at the same time.

Change 1003083 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Profiler: Silence "RedisException: Connection timed out"

https://gerrit.wikimedia.org/r/1003083

Change 1003083 merged by jenkins-bot:

[operations/mediawiki-config@master] Profiler: Silence "RedisException: Connection timed out"

https://gerrit.wikimedia.org/r/1003083

Change 1008752 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Profiler: Silence "RedisException: Connection timed out" (take 2)

https://gerrit.wikimedia.org/r/1008752

Change 1008752 merged by jenkins-bot:

[operations/mediawiki-config@master] Profiler: Silence "RedisException: Connection timed out" (take 2)

https://gerrit.wikimedia.org/r/1008752

Krinkle claimed this task.
Krinkle triaged this task as High priority.