Page MenuHomePhabricator

[toolforge] Redis refusing connections
Open, In Progress, HighPublicBUG REPORT

Description

On two occasions, Toolforge Redis stopped accepting new connections with the following error:

Apr 28 13:32:11 tools-redis-7 prometheus-redis-exporter[1095897]: time="2024-04-28T13:32:11Z" level=error msg="Couldn't set client name, err: ERR max number of clients reached"

@taavi suspects this is caused by clients not closing their connections, as Redis by default does not have a connection timeout.

11:32 <taavi> the redis docs say that 'By default recent versions of Redis don't close the connection with the client if the client is idle for many seconds: the connection will remain open forever.', which seems like exactly the sort of thing that would cause this kind of issuue
11:33 <dhinus> yep, did we update the Redis version recently?
11:33 <taavi> not as far as I'm aware
11:33 <dhinus> or maybe some new tool is using the connection without closing it?
11:33 <dhinus> is there a Redis setting to force-close the connections after some time?
11:34 <taavi> yeah, or some network instability causing more connections to drop, or something liek that
11:34  * arturo mumbles a joke about redis wanting to be replaced by valkey
11:34 <dhinus> I will open a task to track this issue
11:34 <taavi> apparently the 'timeout' setting can be used for that
11:34 <taavi> https://redis.io/docs/latest/develop/reference/clients/#client-timeouts

Event Timeline

fnegri triaged this task as High priority.Apr 29 2024, 2:28 PM

Redis 7, which is available in bookworm, includes the ability to configure client eviction when the server hits a memory pool limit. https://redis.io/docs/latest/develop/reference/clients/#client-eviction

dcaro changed the task status from Open to In Progress.Apr 30 2024, 1:29 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 09) board.

Change #1029158 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] P:toolforge:redis_sentinel: set redis timeout

https://gerrit.wikimedia.org/r/1029158

fnegri changed the task status from In Progress to Stalled.Fri, May 10, 1:33 PM
fnegri moved this task from In Progress to In Review on the Toolforge (Toolforge iteration 09) board.

Today I randomly found the task T318479: Intermittent redis connection timeouts in Toolforge which makes me slightly worried that setting a timeout could cause issues to some clients. On the other hand, considering the Redis server is shared with many tools, it seems risky not to have a timeout at all, because a connection leak from one single tool can affect all other tools using Redis.

fnegri changed the task status from Stalled to In Progress.Mon, May 27, 10:25 AM

Change #1029158 merged by FNegri:

[operations/puppet@production] P:toolforge:redis_sentinel: set redis timeout

https://gerrit.wikimedia.org/r/1029158

I merged https://gerrit.wikimedia.org/r/1029158 today, but the change has not rolled out to the Redis servers yet, because of 2 reasons:

  1. (now resolved) the Redis servers were not fetching the latest Puppet code (T364492)
  2. (still unresolved) I expected the patch to add a line timeout 600 to /etc/redis/tcp_6379.conf, but for some reason it did not

(still unresolved) I expected the patch to add a line timeout 600 to /etc/redis/tcp_6379.conf, but for some reason it did not

The config file did not change, because that file has replace => false in Puppet, which means Puppet will create it if it does not exist, but will never overwrite it with new content. This is because of T309014.

The fix is to delete the file, then run Puppet to recreate it with the new values. I will do this on Monday to avoid breaking anything over the weekend.

I added a note to the Admin/Redis wiki, and created T366365: [toolforge] [redis] Improve Puppet config to track a possible improvement.