Page MenuHomePhabricator

Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request URL: Various URLs
Request ID: 1: XS9NpApAAEAAAJrfN7AAAAAA, or 2: XS9NpApAAEAAAJrfN7AAAAAA

message
Redis exception connecting to "/var/run/nutcracker/redis_eqiad.sock": read error on connection
trace1
#0 [internal function]: Redis->checkConnection()
#1 [internal function]: Redis->sockReadLine()
#2 [internal function]: Redis->sockReadData(NULL)
#3 [internal function]: Redis->processBooleanResponse()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnectionPool.php(249): Redis->auth(string)
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(393): RedisConnectionPool->getConnection(string, Monolog\Logger)
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(92): RedisBagOStuff->getConnection(string)
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/BagOStuff.php(206): RedisBagOStuff->doGet(string, integer)
#8 /srv/mediawiki/php-1.34.0-wmf.14/extensions/AbuseFilter/includes/AbuseFilter.php(761): BagOStuff->get(string)
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/MWCallableUpdate.php(38): Closure$AbuseFilter::recordProfilingResult()
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(372): MWCallableUpdate->doUpdate()
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(279): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(224): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(148): DeferredUpdates::handleUpdateQueue(array, string, integer)
#14 /srv/mediawiki/php-1.34.0-wmf.14/extensions/EventBus/includes/JobExecutor.php(89): DeferredUpdates::doUpdates()
#15 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute(array)
#16 {main}
trace2
#0 [internal function]: Redis->checkConnection()
#1 [internal function]: Redis->sockReadLine()
#2 [internal function]: Redis->sockReadData(NULL)
#3 [internal function]: Redis->processBooleanResponse()
#4 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/redis/RedisConnectionPool.php(249): Redis->auth(string)
#5 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(393): RedisConnectionPool->getConnection(string, Monolog\Logger)
#6 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/RedisBagOStuff.php(92): RedisBagOStuff->getConnection(string)
#7 /srv/mediawiki/php-1.34.0-wmf.14/includes/libs/objectcache/BagOStuff.php(206): RedisBagOStuff->doGet(string, integer)
#8 /srv/mediawiki/php-1.34.0-wmf.14/extensions/AbuseFilter/includes/AbuseFilter.php(761): BagOStuff->get(string)
#9 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/MWCallableUpdate.php(38): Closure$AbuseFilter::recordProfilingResult()
#10 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(372): MWCallableUpdate->doUpdate()
#11 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(279): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#12 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(224): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#13 /srv/mediawiki/php-1.34.0-wmf.14/includes/deferred/DeferredUpdates.php(148): DeferredUpdates::handleUpdateQueue(array, string, integer)
#14 /srv/mediawiki/php-1.34.0-wmf.14/extensions/EventBus/includes/JobExecutor.php(89): DeferredUpdates::doUpdates()
#15 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute(array)
#16 {main}

Impact

For the AF one, no data profiling is saved, which is not that bad. For others, I don't know. This seems to be a pretty consistent error, unrelated to the code path triggering it.

Notes

Seen 613 times, since the wmf.14 rollout to group1 I believe.

Event Timeline

However, it did indeed start 16-07-2019 shortly after wmf.14 went to group0 for the first time. But since then, affecting both wmf.13 and wmf.14 https://logstash.wikimedia.org/goto/e88b2f9f69359bc2d6b51799d5a6d9d5

Screenshot 2019-07-18 at 16.46.30.png (762×2 px, 63 KB)

Screenshot 2019-07-18 at 16.48.37.png (562×626 px, 17 KB)

Impact: Some web requests to unable to connect Redis, which is used for SessionManager and MainStash. As far as I know, there is no retry, which means these requests would be unable to read the user session object, or stash other information that subsequent requests might be relying on.

Tagging Operations, as this might be indicative of general infra issue with nutcracker or HHVM.

From IRC:

<elukey> [2019-07-17 15:02:…] nc_redis.c parsed unsupported command 'WATCH'
<elukey> in the nutcracker logs

This is likely caused by 7e647d2f0f5ba9d4. That is a command MediaWiki can send as of last week, after a connection is successfully established. The error in this task is due to Redis/Nutcracker being unavailable on some app servers to establish a connection to.

If the above commands are the root cause of that, that would still be worthy of additional investigation as it should not make the service unavailable for everyone if one request is making some invalid commands.

In any event, I'll revert the use of watch() to see if it helps.

Change 524269 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

https://gerrit.wikimedia.org/r/524269

Re-blocking the train. Turns out that while wmf.13 is also affected, it is affected because there are wikis with wmf.14 breaking Redis for everyone else.

greg triaged this task as Unbreak Now! priority.Jul 18 2019, 5:02 PM

Change 524278 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

https://gerrit.wikimedia.org/r/524278

Change 524279 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.34.0-wmf.14] Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

https://gerrit.wikimedia.org/r/524279

Checked logs on mw1345:

elukey@mw1345:~$ grep -rni "WATCH" /var/log/nutcracker/ | egrep -o '2019-07-[0-9]*\ [0-9]{2}:' | sort | uniq  -c
     75 2019-07-16 14:
     58 2019-07-16 17:
     75 2019-07-16 19:
    150 2019-07-16 20:
    151 2019-07-16 21:
     17 2019-07-17 03:
     58 2019-07-17 05:
     32 2019-07-17 06:
     75 2019-07-17 10:
     75 2019-07-17 12:
  74167 2019-07-17 14:
  84233 2019-07-17 15:
  73242 2019-07-17 16:
   8594 2019-07-17 17:
    266 2019-07-17 18:
      1 2019-07-17 19:
    179 2019-07-17 21:
      6 2019-07-18 01:
     77 2019-07-18 04:
      1 2019-07-18 07:
    303 2019-07-18 09:
    128 2019-07-18 11:
    159 2019-07-18 12:
  32271 2019-07-18 13:
  97031 2019-07-18 14:
 156448 2019-07-18 15:
  89981 2019-07-18 16:
  25954 2019-07-18 17:

Seems to be started at around the same time of the logstash errors, plus there are hours in which more errors are registered.

Change 524278 abandoned by Krinkle:
Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

Reason:
For master, superseded by Ic37efc2963b14.

https://gerrit.wikimedia.org/r/524278

Change 524279 merged by jenkins-bot:
[mediawiki/core@wmf/1.34.0-wmf.14] Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

https://gerrit.wikimedia.org/r/524279

Krinkle lowered the priority of this task from Unbreak Now! to High.Jul 18 2019, 5:48 PM

Prod recovered.

Change 524269 merged by jenkins-bot:
[mediawiki/core@master] Revert "objectcache: fix race conditions in RedisBagOStuff::incr()"

https://gerrit.wikimedia.org/r/524269

aaron claimed this task.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM