Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out
Open, HighPublic

Description

Paired errors connecting to the job queue redis instances:

Feb  3 07:51:10 mw1226:  #012Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out
Feb  3 07:51:10 mw1226:  #012Warning: Failed connecting to redis server at rdb1001.eqiad.wmnet: Connection timed out

Can be found by search redis on https://logstash.wikimedia.org/

Seems to be load related and/or mostly from job runners. Example:

Unable to connect to redis server rdb1003.eqiad.wmnet:6380.
#0 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobQueueRedis.php(310): JobQueueRedis->getConnection()
#1 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobQueue.php(372): JobQueueRedis->doPop()
#2 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobQueueFederated.php(290): JobQueue->pop()
#3 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobQueue.php(372): JobQueueFederated->doPop()
#4 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobQueueGroup.php(204): JobQueue->pop()
#5 /srv/mediawiki/php-1.28.0-wmf.18/includes/jobqueue/JobRunner.php(160): JobQueueGroup->pop()
#6 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#7 {main}

4000 messages over an hour :( Seems redis servers are saturated.

From dupe task T130078

Logstash shows up a good spam of messages "Unable to connect to redis server" on https://logstash.wikimedia.org/app/kibana#/dashboard/Redis

The top offenders:

server:port24 hours hits
rdb1001.eqiad.wmnet:6380166685
rdb1001.eqiad.wmnet:6379165655
rdb1001.eqiad.wmnet:6381159330
rdb1005.eqiad.wmnet:638073329
rdb1005.eqiad.wmnet:638168856
rdb1005.eqiad.wmnet:637965288
rdb1003.eqiad.wmnet:637913038
rdb1007.eqiad.wmnet:637913015
rdb1007.eqiad.wmnet:638012778
rdb1007.eqiad.wmnet:638112351

Maybe it is due to the currently abnormal of refreshLink jobs being processed. Maybe the Nutcracker proxy is not used or redis instance is being overloaded / has not enough connections.

Also it seems load-related as there is a daily pattern for "unable to connect to redis server"

Debugging

One can run a jobrunner process in foreground mode with debug log send to stdout (example spam P5240 ):

sudo -H -u  www-data /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/tmp/jobrunner-debug.conf  --verbose
There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar added a subscriber: hashar.Feb 3 2016, 11:14 PM

I have seen them in logstash for a while. Assumed it was some ongoing issue.

hashar renamed this task from Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wm net [110]: Connection timed out to Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out.Sep 15 2016, 12:35 PM
hashar added a project: Operations.
hashar updated the task description. (Show Details)
hashar set Security to None.
Dzahn triaged this task as Normal priority.Sep 22 2016, 2:33 AM
elukey added a subscriber: elukey.Sep 23 2016, 10:28 AM
hashar updated the task description. (Show Details)Sep 23 2016, 10:43 AM
thcipriani raised the priority of this task from Normal to High.Feb 6 2017, 10:43 PM
thcipriani added a subscriber: thcipriani.

ping! Has anyone been able to look into this? This is currently one of the many things we've had to sweep under the fatalmonitor rug (thanks logstash filters) as it's become a huge source of noise.

I can't overstate how much clearing this task would help logspam, deployers, and the ability to safely and effectively move out code.

My understanding is could be "fixed" on the php side of things by someone (probably @aaron [/me throws someone under the bus]) modifying logging somehow or this could be mitigated by changing the actual redis timeout. I don't have any opinions about this other than the behavior now seems incorrect and is frustrating for anyone digging into logs.

elukey added a comment.EditedFeb 7 2017, 8:06 AM

@thcipriani Hi! I thought that the error was changed but it is not, not enough coffee before writing :)

Nutcracker is not used for JobQueues so there must be something weird happening when executing JobQueueRedis->getConnection() that causes the timeout.

As you mentioned before, maybe 200ms of timeout for a Jobrunner is too tight? I understand a limit like that for an appserver but not for a batch job scheduler :)

hashar added a comment.Feb 7 2017, 8:48 AM

We talked about it during releng meeting yesterday. To capture some of the random ideas I had:

a) the jobrunner shards are defined in mediawiki-config, maybe we could migrate to use nutcracker and have the shards defined in puppet. This way the jobrunner process will connect to localhost nutcracker which would keep a permanent connection with the jobrunner redis. At least that is how I understand nutcracker.

b) the jobrunner are heavily active, maybe redis has a limit to the number of connections it can accept or the Linux TCP stacks reach the max connections. Thus the socket is never opened/accepted on the redis side and the client ends up with a timeout.

c) have jobrunner client to throttle the connection on failure/timeout and retry gracefully after X time.

But really that is random thoughts. @elukey since we are in the same timezone, I am willing to help as much as I can :-]

hashar added a comment.Feb 7 2017, 9:20 AM

Some mediawiki-config history:

March 17 2016, timeout bump from 250ms to 300ms 4c7a6ec3de00256878df9f12999fefff32652f84

 $wmgRedisQueueBaseConfig = array(
        'class' => 'JobQueueRedis',
        'redisConfig' => array(
-               'connectTimeout' => .250,
+               'connectTimeout' => .300,

March 11th 2016, timeout bump from 200ms to 250ms and the persistent connection is dropped a4bd3dc844256efa2762608d2e3126ce9e3a1b0b

 $wmgRedisQueueBaseConfig = array(
        'class' => 'JobQueueRedis',
        'redisConfig' => array(
-               'connectTimeout' => .200,
+               'connectTimeout' => .250,
                'password' => $wmgRedisPassword,
                'compression' => 'gzip',
-               'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' )
+               #'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' )

And at some point it was 2000 ms 12d6cea38f797974700835c2b70dfc947f209800

 $wmgRedisQueueBaseConfig = array(
        'class' => 'JobQueueRedis',
        'redisConfig' => array(
-               'connectTimeout' => 2,
+               'connectTimeout' => .250,

@aaron I guess we can just raise the timeout again? Maybe look at enabling persistent connection again?

As you mentioned before, maybe 200ms of timeout for a Jobrunner is too tight? I understand a limit like that for an appserver but not for a batch job scheduler :)

Raising the limits here might mitigate some of the impact of this message in logstash. I don't know if that's the Right™ solution here or not. Seems like the commits lowering the timeout and removing persistent connections have some context that I'm missing.

@aaron could you weigh-in on this?

Joe added a comment.EditedFeb 17 2017, 7:26 AM

Hi! I'm the one who suggested most of those timeout changes. Some have different historical reasons, but I think we can safely raise the connect timeout for the jobrunners (NOT for the common appservers).

BUT!

In the end, this won't really help us: our issue is the way we use redis, with big bowls of lua that will be executed on a single thread, and that *is* a bottleneck.

What we can (and should) do is the following:

  • Stop replicating locally from e.g. rdb1001 to rdb1002; assume the redundancy we have cross-dc is enough. Do the same for rdb1004, rdb1006 and rdb1008
  • Add a new set of shards on these 4 machines, so that the load on each redis instance is even lower
  • Do the same in codfw

*This* will reduce the number of those messages, nothing else. Or, we could move off of redis and live an happier life, but that's not only ops work.

Interesting thing that I found today while debugging with tcpdump. There are constant TCP RST packets that are returned from the Redis instances on rdb100[1357] to the mw hosts, that seems happening right at the end of a tcp stream (after FIN,ACK).

Example of wireshark output of a tcpdump captured on rdb1005 restricting the flow to only mw1161:

The "Follow TCP Stream" option shows (added some [REDACTED] tags to be on the safe side):

*2
$4
AUTH
$20
[REDACTED]
+OK
*10
$7
EVALSHA
$40
[REDACTED]
$1
6
$57
ruwikinews:jobqueue:wikibase-addUsagesForPage:l-unclaimed
$56
ruwikinews:jobqueue:wikibase-addUsagesForPage:h-sha1ById
$56
ruwikinews:jobqueue:wikibase-addUsagesForPage:h-idBySha1
$55
ruwikinews:jobqueue:wikibase-addUsagesForPage:z-claimed
$56
ruwikinews:jobqueue:wikibase-addUsagesForPage:h-attempts
$52
ruwikinews:jobqueue:wikibase-addUsagesForPage:h-data
$10
[REDACTED]
$-1
*1
$4
QUIT
+OK

This https://github.com/phpredis/phpredis/issues/562 might be relevant since I can see the QUIT command in the end (assuming that we use phpredis that might be wrong! :).

This issue might or might not be related to what we are trying to solve but it is definitely (in my opinion) something to fix.

elukey added a comment.EditedMar 17 2017, 11:11 AM

Another thing worth to discuss is the following snippet:

https://github.com/wikimedia/mediawiki/blob/wmf/1.29.0-wmf.12/includes/libs/redis/RedisConnectionPool.php#L230-L246

// Mark server down for some time to avoid further timeouts
$this->downServers[$server] = time() + self::SERVER_DOWN_TTL;

To rule it out we might check for the debug log in https://github.com/wikimedia/mediawiki/blob/wmf/1.29.0-wmf.12/includes/libs/redis/RedisConnectionPool.php#L185 but I am not sure how to instruct mw to log it!

Mentioned in SAL (#wikimedia-releng) [2017-04-03T09:17:33Z] <hashar> deployment-jobrunner02 : cherry picked a monkey patch for Redis::close() to prevent it from sending QUIT command ( https://gerrit.wikimedia.org/r/#/c/346117/ ) - T125735

elukey added a comment.Apr 3 2017, 4:28 PM

The issue with QUIT seems more subtle, namely only sometimes the RST happens after a QUIT command.

For example, the following TCP stream does not end up in a RST:

*2
$4
AUTH
$20
REDACTED
+OK
*10
$7
EVALSHA
[..]
$-1
*1
$4
QUIT

This time mw1161 and rdb1005 (the two IPs listed) end up in [FIN,ACK] and the "OK" that Redis should return after QUIT is not there. It seems like Redis sometimes sends the OK and some other times does not, creating the RST issue.

Me and @hashar tried to monkey patch deploymentprep-jobrunner02 to avoid the QUIT but we didn't succeed (at least, I still can see the QUITs sent by HHVM to Redis).

Not sure if this issue is directly connected to the timeouts though (probably not), might be better to open a subtask.

elukey added a comment.Apr 3 2017, 4:43 PM

Returning to the main timeout issue, it seems to me that the next step is trying to find somebody to increase only the connectTimeout to Redis for the jobrunners to something less strict that 0.2s, but I have no idea who could help.

@aaron - would you have time to give us some direction about how to (at least try) a bigger connect timeout only for the jobrunners?

aaron added a comment.Apr 3 2017, 11:25 PM

In $wmgRedisQueueBaseConfig in wmf-config/jobqueue.php I see the timeout is currently 0.3. It could be changed with a patch there. This would not effect redis timeouts for session storage.

In $wmgRedisQueueBaseConfig in wmf-config/jobqueue.php I see the timeout is currently 0.3. It could be changed with a patch there. This would not effect redis timeouts for session storage.

But would it change the timeouts for the app/api servers trying to enqueue jobs to Redis? The current issue seems related only to jobrunners so it would be safer to bump only their timeouts rather than all of them (but maybe I am missing some knowledge and this is already the case :).

aaron added a comment.Apr 4 2017, 11:35 PM

The timeout could be conditioned on

defined( 'MEDIAWIKI_JOB_RUNNER' )

...via a ternary.

Change 346508 had a related patch set uploaded (by Elukey):
[operations/mediawiki-config@master] Increase Redis connection timeout for MediaWiki Jobrunners

https://gerrit.wikimedia.org/r/346508

@aaron thanks a lot for the feedback, I created a code change that should do what you mentioned. I am wondering if we could re-enable connection persistence for the jobrunners, not sure if your change was meant to be permanent or only to fix a queue overloading temp issue.

Change 346705 had a related patch set uploaded (by Hashar):
[operations/puppet@production] (DO NOT SUBMIT) prevent scap on jobrunner02

https://gerrit.wikimedia.org/r/346705

Change 346705 abandoned by Hashar:
(DO NOT SUBMIT) prevent scap on jobrunner02

Reason:
Was for a redis debug session on beta cluster that proven successful. Was done by directly editing the dsh file anyway.

https://gerrit.wikimedia.org/r/346705

Discovered one thing today (that might be probably be trivial but I didn't realize it up to now). In logstash if we isolate one jobrunner in the search bar, like mw1306, the following messages are logged (various rdb hosts and ports):

Could not connect to server "rdb1001.eqiad.wmnet:6380"`
Unable to connect to redis server rdb1001.eqiad.wmnet:6380.
#0 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobQueueRedis.php(315): JobQueueRedis->getConnection()
#1 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobQueue.php(373): JobQueueRedis->doPop()
#2 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobQueueFederated.php(290): JobQueue->pop()
#3 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobQueue.php(373): JobQueueFederated->doPop()
#4 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobQueueGroup.php(220): JobQueue->pop()
#5 /srv/mediawiki/php-1.29.0-wmf.19/includes/jobqueue/JobRunner.php(165): JobQueueGroup->pop()
#6 /srv/mediawiki/rpc/RunJobs.php(47): JobRunner->run()
#7 {main}

The first one is logged in RedisConnectionPool->getConnection, meanwhile the second one JobQueueRedis->getConnection() (that uses RedisConnectionPool).

As noted before when a connection failure happens in RedisConnectionPool (triggering the first message), the correspondent Redis server is marked as dead for 30 seconds:

// Mark server down for some time to avoid further timeouts
$this->downServers[$server] = time() + self::SERVER_DOWN_TTL;

So if I got it correctly, a simple connection failure ("Could not connect to") triggers some "Unable to connect to" messages. So if I focus my attention on the former message, I don't where the "Warning: timed out after 0.2" message of the task's description is logged.

@bd808 can you give me some info about --^ ?

Thanks!

Mentioned in SAL (#wikimedia-operations) [2017-04-07T10:53:51Z] <elukey> increase Redis connection timeout manually (.3s -> .5s) on mw1306 as performance test - T125735

I tried to manually modify /srv/mediawiki/wmf-config/jobqueue.php on mw1306 to test .500 and .800 timeout values (assumption: didn't restart hhvm since afaik the change should have been taken into account right after the file change, as normal deployments do). Observed logstash for mw1306 but nothing changed.

So I took a look to the /var/log/apache2/jobqueue-access.log response time (second field, https://wikitech.wikimedia.org/wiki/Apache_log_format). Every time that a Could not connect to server error is logged there seems to be a "slow" response time (that of course does not take into account only Redis but it is a good indicator). For example:

Logstash:

2017-04-07T11:27:53	message:Could not connect to server "rdb1007.eqiad.wmnet:6379" url:/rpc/RunJobs.php?wiki=simplewiki&type=RecordLintJob&maxtime=60&maxmem=300M

Apache:

2017-04-07T11:27:53   46999   http://127.0.0.1:9005/rpc/RunJobs.php?wiki=dawiki&type=cirrusSearchCheckerJob&maxtime=60&maxmem=300M
2017-04-07T11:27:53   46432   http://127.0.0.1:9005/rpc/RunJobs.php?wiki=ckbwiki&type=cirrusSearchLinksUpdatePrioritized&maxtime=60&maxmem=300M

2017-04-07T11:27:53   335974 http://127.0.0.1:9005/rpc/RunJobs.php?wiki=simplewiki&type=RecordLintJob&maxtime=60&maxmem=300M
                                          ----^
2017-04-07T11:27:53   35807   http://127.0.0.1:9005/rpc/RunJobs.php?wiki=kuwiktionary&type=cirrusSearchCheckerJob&maxtime=60&maxmem=300M
2017-04-07T11:27:53   41798   http://127.0.0.1:9005/rpc/RunJobs.php?wiki=hrwiki&type=cdnPurge&maxtime=60&maxmem=300M

So the response took more than .3s for this particular use case, but there are worst performers of several seconds (sampled mw1306's access log and ~2.5% of response times were above 1s).

This issue might be a lot more complicated that what we thought, especially following what @Joe wrote above about Redis being sigle threaded and executing big pile of blocking Lua code.

Change 346508 abandoned by Elukey:
Increase Redis connection timeout for MediaWiki Jobrunners

https://gerrit.wikimedia.org/r/346508

bd808 added a comment.Apr 7 2017, 3:29 PM

So if I got it correctly, a simple connection failure ("Could not connect to") triggers some "Unable to connect to" messages. So if I focus my attention on the former message, I don't where the "Warning: timed out after 0.2" message of the task's description is logged.

The Feb 3 07:51:10 mw1226: #012Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out and Feb 3 07:51:10 mw1226: #012Warning: Failed connecting to redis server at rdb1001.eqiad.wmnet: Connection timed out messages are logged in the hhvm channel in logstash. These are PHP warning events that are caught on the stderr stream of the PHP runtime. Check https://logstash.wikimedia.org/app/kibana#/dashboard/hhvm

Thanks a lot! The errors logged at the moment are clouded by https://phabricator.wikimedia.org/T158770, that hopefully will be resolved soon.

elukey added a comment.EditedApr 10 2017, 5:31 PM

Summary of the various issues as I understand them:

  1. the description reports a connection timeout happening between appservers and Redis job queues, easily checkable via https://logstash.wikimedia.org/app/kibana#/dashboard/elasticsearch/hhvm. A lot of the errors are probably also due to https://phabricator.wikimedia.org/T158770, but the main underlying issue seems the same one that the jobrunners are experiencing (following item).
  1. the majority of the "Redis" warnings and errors in logstash (https://logstash.wikimedia.org/ -> discovery -> search for redis) seems to be related to what I wrote in T125735#3163049 and T125735#3163543.
  1. The HHVM implementation of phpredis causes frequent TCP RST returned by HHVM to Redis due to T162354. This is probably the issue with lowest priority but it would be nice to get fixed nonetheless.
hashar updated the task description. (Show Details)Apr 11 2017, 2:20 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-12T13:41:57Z] <elukey> apply SLOWLOG RESET and CONFIG SET slowlog-max-len 100000 (prev value 10000, 10ms) to rdb1005:6380 to track down slow reqs - T125735

elukey added a comment.EditedApr 12 2017, 2:30 PM

Didn't find much from the SLOWLOG, I expected more commands to be logged (like EVALSHA). Tried LATENCY DOCTOR and this is interesting:

Dave, I have observed latency spikes in this Redis instance. You don't mind talking about it, do you Dave?

1. command: 17 latency spikes (average 117ms, mean deviation 14ms, period 90638.76 sec). Worst all time event 170ms.
2. aof-write-alone: 160 latency spikes (average 223ms, mean deviation 106ms, period 2273.16 sec). Worst all time event 1248ms.
3. aof-write-active-child: 160 latency spikes (average 313ms, mean deviation 85ms, period 4697.35 sec). Worst all time event 1483ms.
4. aof-write: 160 latency spikes (average 235ms, mean deviation 79ms, period 1629.81 sec). Worst all time event 1483ms.

no-appendfsync-on-rewrite yes is set.

We only use AOF persistence (no regular RDB snapshot), and I don't find super bad slow logs for EVALSHA (and frequent enough to match the timeouts).

When running a LUA script does it blocks new connections as well?

When running a LUA script does it blocks new connections as well?

From https://redis.io/commands/eval (we use EVALSHA but it should be the same):

Atomicity of scripts

Redis uses the same Lua interpreter to run all the commands. Also Redis guarantees that a script is executed in an atomic way: no other script or Redis command will be executed while a script is being executed. This semantic is similar to the one of MULTI / EXEC. From the point of view of all the other clients the effects of a script are either still not visible or already completed.
However this also means that executing slow scripts is not a good idea. It is not hard to create fast scripts, as the script overhead is very low, but if you are going to use slow scripts you should be aware that while the script is running no other client can execute commands.

There is a global lock to execute commands, but would it also prevents establishing a new connection entirely? While a script runs, I would expect new connection to be honored, the command parsed but the actual processing and result being delayed till the lock is released.

But maybe the script also locks the network connection code, and that would indeed explains the issue. Then beside raising timeout there is not much more we can do.

Mentioned in SAL (#wikimedia-operations) [2017-04-13T10:22:57Z] <elukey> executed CONFIG SET appendfsync no (prev value: "everysec") to Redis instance 6380 on rdb2005 - T125735

There is a global lock to execute commands, but would it also prevents establishing a new connection entirely? While a script runs, I would expect new connection to be honored, the command parsed but the actual processing and result being delayed till the lock is released.

But maybe the script also locks the network connection code, and that would indeed explains the issue. Then beside raising timeout there is not much more we can do.

Tried to install redis-server on a debian jessie VM, created a little lua script doing GET foo in a while true and tried to use both redis-cli and echo | nc localhost 6739. The result is the following:

BUSY Redis is busy running a script. You can only call SCRIPT KILL or SHUTDOWN NOSAVE.

So I'd say that connections are accepted, but no command is processed.

@aaron any suggestion about the direction to take from your previous work with job queues and Redis?

Nice test @elukey, looks like there is at least a TCP connection established. The log errors Warning: Failed connecting to redis server at rdb1001.eqiad.wmnet: Connection timed out comes from HHVM Redis.php:

function doConnect(...) {
    $conn = fsockopen(...);
    if( ! $conn ) {
        trigger_error(
            "Failed connecting to redis server at {$host}: {$errstr}",
            E_WARNING);
        return false;
    }

So looks like the Redis server does not even manage to open the TCP connection. On the other hang your busy loop test shows the server does open the connection and replies with a BUSY message (there is no timeout).

@hashar nice finding!

So let me recap the errors that we are seeing so we can brainstorm:

Warning: Failed connecting to redis server at rdb1001.eqiad.wmnet: Connection timed out seems generated by HHVM's Redis.php client while trying to create a socket to a Redis server. As far as I can see from Logstash, this error message is logged by app/api-servers only and it is sporadic.

Could not connect to server "rdb1001.eqiad.wmnet:6381" (and other rdbs:ports) seems to be generated in RedisConnectionPool.php's getConnection (that uses the HHVM's Redis.php client) and it is related to jobrunners only. I checked and they don't generate the above Warning in logstash and in the hhvm log.

Unable to connect to redis server rdb1001.eqiad.wmnet:6379 (and other rdbs:ports) seems to be a side effect of the above one (server put temporarily offline from the client's perspective for 30s by RedisConnectionPool).

Failed connecting to redis server at rdb1003.eqiad.wmnet: Bad file descriptor seems common to both logstash dashboards but IIRC it is due to T158770

While reviewing the above data and graphite metrics I realized that I might have missed something about the jobrunners, so I re-checked netstat metrics:

elukey@mw1306:~$ sudo netstat -tuap | grep TIME_WAIT | grep rdb | wc -l
70083

elukey@mw1306:~$ sudo sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768	60999

We removed the persistent connections in T129517#2113526

@aaron - I know that I am annoying, please be patient :) - Do you remember if that change was really needed or if it was only avoiding more problems during the outage? I am seeing that we use tons of TCP connections across the appservers, and the jobrunners are the most affected ones.

We could try to increase net.ipv4.ip_local_port_range as we do in other places, but the problem would remain there.

So on every mw host we set net.ipv4.tcp_tw_reuse=1, this is probably why we can handle that many sockets in TIME_WAIT without incurring in bigger issues.

I tried today to set net.ipv4.ip_local_port_range="15000 64000" on mw1306 (jobrunner) as an attempt to increase the available local ports for sockets and see the effect to the Redis connections. I have observed a higher amount of sockets in TIME_WAIT, and of course a overall increase in sockets. No big difference in the number of Redis connection timeout errors though afaics from logstash.

In any case, I'd say that even if we rely on net.ipv4.tcp_tw_reuse=1 we shouldn't have that many sockets opened to Redis (and others) but maybe a better connection pooling with (possibly) persistent connections.

Any other though beside mine and Hashar's would be great to figure out what is the best way to proceed :)

After some (Easter) thinking I am convinced that the most probable root cause of this issue is not using persistent connections between HHVM and Redis.

There are a couple of things that we can try to do on both sides to test:

  1. Pick a jobrunner and enable persistent connections to Redis in mediawiki config. As mentioned before we disabled it in T129517#2113526, even if re-reading the task it is not super clear what problem was causing.
  1. Pick a Redis Job queue and increase its tcp-backlog size (default 511). I don't love this option since 1) should be tried first in my opinion, increasing the tcp-backlog feels more like putting a band-aid rather than fixing the underlying root cause.

There are also other things that we can explore, but only if we resolve the above points first:

  1. Increase the number of Redis hosts and Redis shards. For example, rather than replicating each shard to eqiad and codfw (for example rdb1003 replicates to rdb2003 and rdb1004) we could simply replicate to codfw and use the eqiad host to run more independent Redis instances. This would ease a bit pressure on rdb hosts [[citation_needed]].
  1. Investigate latencies of lua EVALSHA commands, that from Redis slowlog seems to reach even 100ms (blocking all the other commands execution). This is currently a big bottleneck of our Job Queues, but afaik we have not yet figure out how bad it impacts scalability (I mean, with numbers and graphs :).
  1. Investigate how Redis latencies are affected due to AOF fsync performed every 1s (default). I don't see a ton of IO wait on rdb1* instances but Redis authors reported multiple times that fsync can cause Redis stalls when there is too much disk contention and/or the disks are slow.
elukey added a subscriber: Krinkle.
aaron added a comment.Apr 20 2017, 8:41 PM

I think it's worth trying persistent connections again.

If there is no intra-DC replication. How would server failures be handled? If server 1XXX fails, would you clone from the other DC 2XXX counterpart to a spare 1YYY and use that one?

It also might be useful to put limits on the Lua script in PeriodicScriptParamsIterator via 'LIMIT 0 100'. I've lowered MediaWiki-side push limits in https://gerrit.wikimedia.org/r/#/c/349113/ . This might lower latency.

I think it's worth trying persistent connections again.

+1, I'd really like to test it. Today I live hacked /srv/mediawiki/wmf-config/jobqueue.php (and also jobqueue-labs.php) on deployment-jobrunner02.deployment-prep.eqiad.wmflabs but I wasn't probably able to set the persistent connection setting, since I haven't observed any change in the following metric:

https://graphite-labs.wikimedia.org/render/?width=800&height=600&target=deployment-prep.deployment-jobrunner02.network.connections.TIME_WAIT

We could start from labs and test, or just use a jobrunner in production as "canary", let me know which option you prefer :)

If there is no intra-DC replication. How would server failures be handled? If server 1XXX fails, would you clone from the other DC 2XXX counterpart to a spare 1YYY and use that one?

Nothing has been planned yet, we are exploring options to the current replication scheme (1 "local" replica in the active DC and one "remote" replica). We could probably come up with a smarter and more efficient scheme (namely trying to reduce the hw need if possible). Failing over to DC 2XXX directly (if DC 1XXXX fails) might not be a viable option due to the need of IPSEC between all the mw hosts and Redis in codfw, but we think about a different strategy like using only one stand-by host as backup if one of the 1XXX fails. Everything is still fuzzy at the moment, but it would be really great to discuss it with you and Joe during the next weeks to come up with a good and shared way forward.

It also might be useful to put limits on the Lua script in PeriodicScriptParamsIterator via 'LIMIT 0 100'. I've lowered MediaWiki-side push limits in https://gerrit.wikimedia.org/r/#/c/349113/ . This might lower latency.

Thanks :) Another good thing to do would be to start a Wikitech page (if there is none) with Jobqueue/jobrunner tunables to be aware of..

elukey moved this task from Backlog to In Progress on the User-Elukey board.Tue, May 2, 4:28 PM

I think it's worth trying persistent connections again.

In the meantime https://gerrit.wikimedia.org/r/#/c/349722/ was merged. I'd like to test persistent connections again right after the switchover, maybe only for jobrunners in the beginning. How does it sound?

hashar added a comment.Thu, May 4, 2:31 PM

I am all for enabling persistent connections again. Ideally we would deploy it on a single jobrunner for a start and progressively expand since I am not sure what kind of havoc it is going to cause.

Change 351854 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Re-enable persistent connection to Redis for jobrunners

https://gerrit.wikimedia.org/r/351854

Change 353247 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/mediawiki-config@master] Re-enable persistent connection to Redis for jobrunners in labs

https://gerrit.wikimedia.org/r/353247

Change 353247 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable persistent connection to Redis for jobrunners in labs

https://gerrit.wikimedia.org/r/353247

elukey added a comment.EditedFri, May 12, 11:57 AM

After deploying https://gerrit.wikimedia.org/r/353247 to labs I have observed no relevant changes to TCP metrics of any of the following:

  • deployment-jobrunner02
  • deployment-mediawiki05
  • deployment-redis01

From prometheus-beta is easy to select a graph and plot it. As example, here's the list of ESTABLISHED TCP connections for the hosts mentioned above:

After a chat with @hashar it seems that the RedisConnectionPool.php class (that offers persistent connections) is used only by mediawiki, not by the jobrunner service (that instanciate Redis() without any trace of pconnect).

I tried to live hack on deployment-prep.mediawiki05 jobqueue-beta.php changing 'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' ) to 1 but nothing has really changed.

My understanding of the job queues is still very very bad, so I have these (possibly dumb) questions for @Krinkle or @aaron:

  • Do you have any idea about what I am doing wrong with https://gerrit.wikimedia.org/r/#/c/353247 ? Is my understanding correct about the fact that this setting is only for mediawiki app/api servers and not jobrunners?
  • Should we patch the jobrunner service to offer persistent connections to Redis too?

Sorry for pushing this but in my opinion we'd need to come up with a good plan forward, let me know your thoughts.

Joe added a comment.Fri, May 12, 1:08 PM

After deploying https://gerrit.wikimedia.org/r/353247 to labs I have observed no relevant changes to TCP metrics of any of the following:

  • deployment-jobrunner02
  • deployment-mediawiki05
  • deployment-redis01

    From prometheus-beta is easy to select a graph and plot it. As example, here's the list of ESTABLISHED TCP connections for the hosts mentioned above:
  • deployment-jobrunner02
  • mediawiki05
  • deployment-redis01

    After a chat with @hashar it seems that the RedisConnectionPool.php class (that offers persistent connections) is used only by mediawiki, not by the jobrunner service (that instanciate Redis() without any trace of pconnect).

    I tried to live hack on deployment-prep.mediawiki05 jobqueue-beta.php changing 'persistent' => defined( 'MEDIAWIKI_JOB_RUNNER' ) to 1 but nothing has really changed.

    My understanding of the job queues is still very very bad, so I have these (possibly dumb) questions for @Krinkle or @aaron:
  • Do you have any idea about what I am doing wrong with https://gerrit.wikimedia.org/r/#/c/353247 ? Is my understanding correct about the fact that this setting is only for mediawiki app/api servers and not jobrunners?
  • Should we patch the jobrunner service to offer persistent connections to Redis too?

    Sorry for pushing this but in my opinion we'd need to come up with a good plan forward, let me know your thoughts.

Jobrunner has 2 components:

  1. the jobrunner script, that is ran from the command line and is NOT the source of the disconnect messages
  2. the hhvm server actually executing the jobs, that is running mediawiki and is the origin of the errors you talk about.

So your change should be correct as long as MEDIAWIKI_JOB_RUNNER is declared in mediawiki-config somehow.

With the help of @Addshore I verified that the setting should be picked up correctly, but I probably I am not going to see any improvements in deployment-prep due to the low job traffic volume. Aaron confirmed to me that RunJobs.php runs multiple jobs at the time ('maxJobs' => isset( $_GET['maxjobs'] ) ? $_GET['maxjobs'] : 5000 in RunJobs.php) but probably labs is not the best place to see improvements (small batches of a few jobs at the time).

Next step is to select on jobrunner in Prod and test the change, will try next week.

I tried to manually hack mw1161 in prod but nothing changed from the perspective of the number of TCP sockets in established/time-wait.

After digging a bit in the HHVM Redis() implementation I opened https://github.com/facebook/hhvm/issues/7854 (tightly coupled with https://gerrit.wikimedia.org/r/#/c/349722).

Moritz build hhvm_3.18.2+dfsg-1+wmf4+exp1_amd64.deb with a patch for https://github.com/facebook/hhvm/issues/7854 (basically forcing STREAM_CLIENT_PERSISTENT in stream_client_connect when persistent is true), I tried to test it in deployment-prep on jobrunner02.

I decided to use tcpdump to monitor what was happening on the TCP connections.

Who open TCP connections to Redis?

  1. The jobchron daemon (running via multiple /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobChronService --config-file=/etc/jobrunner/jobrunner.conf)
  2. The jobrunner daemon (running via multiple /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf)
  3. The HHVM daemon (via multiple POST requests for rpc/RunJobs.php)

First misunderstanding from my side: all the above don't open a single TCP connection for each Redis command, but they do reuse the same TCP connection. This is easily visible via tcpdump, and really straightforward: until close() is called on the Redis connection, it gets reused by multiple commands.

So what is the issue? I opened a tmux session split in two: first panel tailing the Apache jobqueue-access.log, second panel with the following tcpdump query:

sudo tcpdump -n host '10.68.16.177 and tcp port 6379 and (tcp[tcpflags] & (tcp-syn) != 0 or tcp[tcpflags] & (tcp-fin) != 0)'

10.68.16.177 is the most used Redis instance in deployment-prep. The jobrunner periodically sends a batch of POST requests for rpc/RunJobs.php to HHVM and each of those shows in tcpdump a TCP SYN, some redis commands and a subsequent TCP FIN. All these connections will stay in TIME-WAIT for two minutes on the jobrunner, and of course will hit in bursts Redis periodically.

Same thing happens with hhvm_3.18.2+dfsg-1+wmf4+exp1_amd64.deb. I suspect that HHVM is not able to re-use a TCP connection between multiple RunJobs.php calls, hence the issue that we are seeing..

Hope that what I wrote makes sense, let me know if you have any suggestion :)

Next steps:

  1. Follow up with upstream to confirm that this is the expected behavior
  2. Think about our current RunJobs.php implementation and figure out if anything could be improved to reuse conns (not sure if even possible).