Page MenuHomePhabricator

Intermittent redis connection timeouts in Toolforge
Closed, ResolvedPublic

Description

My spi-tools and spi-tools-dev instances both get intermittent connection timeouts trying to connect to 'redis://tools-redis.svc.eqiad.wmflabs:6379/0'

This one's from spi-tools:

2022-09-22 07:05:41,054 [85ed87d139188b1e89459d6256c854f8] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')

These are from spi-tools-dev:

2022-09-24 00:30:28,839 [a02bcb67be496785d354f27de37359a1] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')
2022-09-15 13:23:26,327 [d1b54ab741b3306abd6265a5adfe711d] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')
2022-09-15 13:12:51,447 [d46cdc3ce7e666e86a697d6319fc4753] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')
2022-09-19 02:03:30,663 [c3fc2d85621909d117c9d4fd638d1f9e] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')
2022-09-22 11:03:18,759 [2adbd2f8af1c3f5910fd4594df6ce2f3] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')
2022-09-24 00:43:35,290 [539c3b47867d712a148543369c263d4f] ERROR tools_app.redis: Redis ConnectionError: Error while reading from socket: (110, 'Connection timed out')

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm having/had similar issues with Redis on qpqtool it seems like provisioning a fresh connection per request help, but that does not seem to be the recommended method of using redis-py.

For context, https://github.com/sohomdatta1/qpqtool/commit/dfd65f21b618e3ffdb73d091549524bf748493ce is the change that fixes connection issues for me.

I can also observe this in the editgroups tool, which uses Celery:

[2024-05-22 13:39:42,965: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/data/project/editgroups/www/python/venv/lib/python3.11/site-packages/redis/connection.py", line 706, in send_packed_command
    sendall(self._sock, item)
  File "/data/project/editgroups/www/python/venv/lib/python3.11/site-packages/redis/_compat.py", line 9, in sendall
    return sock.sendall(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Connection timed out

My solution with Redis+Celery is to add a health check script that pings my worker(s), if the worker does not respond, Kubernetes kills the trainwreck and restarts the container.

https://gitlab.wikimedia.org/toolforge-repos/matchandsplit uses the configuration and appears to be working mostly fine.

[2024-05-22 13:39:42,965: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...

Where can I find this log line? I'm trying to understand how frequently this happens, and if I can correlate the time when this happens and other events.

Trying to re-establish the connection... seems to imply that it loses the connection AND it fails to create a new one.

fnegri changed the task status from Open to In Progress.May 23 2024, 10:29 AM
fnegri claimed this task.
fnegri triaged this task as Medium priority.
fnegri moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 10) board.

@fnegri I access the logs like this:

become editgroups
kubectl get pods # find the pod id that starts with `editgroups.celery.sh`
kubectl logs editgroups.celery.sh-58496b6654-gp5pj

In the current logs there seems to be a single occurrence of this error (the others such as "Worker exited prematurely" are likely unrelated, but not sure).

the others such as "Worker exited prematurely"

I think these errors are caused by Kubernetes killing the process because it's reaching the memory limit.

Each pod gets by default a hard limit of 512Mi:

Limits:
  cpu:     500m
  memory:  512Mi
Requests:
  cpu:     250m
  memory:  256Mi

I would try modifying deployment/celery.yaml setting requests.memory: 512Mi and limits.memory: 1Gi. You can also try playing with the Celery settings --concurrency and --max-memory-per-child.

You can check the current memory usage of your pods with kubectl top pod: when it reaches the "limit", Kubernetes will kill one or more processes in the container.

I'm not sure if this could explain also the Redis connection/timeout errors, but it's possible.

I looked again at the errors mentioned in the description of this task, related to spi-tools. They are still happening a few times per day (7 times yesterday, 4 the day before).

What I noticed is that every time that error is logged, the following lines are also logged:

2024-06-03 14:07:33,694 [b216204b6f3202b408525e98829a9d3d] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379: (110, 'Connection timed out')
2024-06-03 14:07:34,072 [b216204b6f3202b408525e98829a9d3d] WARNING spi.icache: get_or_set(IndexView.case_names) took 939.046 sec
2024-06-03 14:07:34,078 [b216204b6f3202b408525e98829a9d3d] INFO tools_app.middleware: request took 0:15:39.367713

This shows the request remains stuck for 15 minutes before failing with the Redis ConnectionError. I haven't found an explanation yet for this very long wait.

This shows the request remains stuck for 15 minutes before failing with the Redis ConnectionError. I haven't found an explanation yet for this very long wait.

I just noticed something interesting. If you look at the distribution of cache access times, there's an abrupt change from just kind of slow to the pathologically slow ones:

1.083 sec 2024-05-09 11:33:48,729 [568be678ba04366e9265a6880437636f] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.083 sec
1.195 sec 2024-06-01 23:10:55,876 [cdd558a225cc3ce810d81b7fcf606bbf] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.195 sec
1.230 sec 2024-05-19 17:28:11,581 [9eb0069df8f8ac8f5f6860ddd7c997d4] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.230 sec
1.238 sec 2024-05-31 02:04:19,605 [51c4aaff49706da5860d57563c6eca6e] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.238 sec
1.707 sec 2024-05-29 11:30:41,862 [1e5392a85e419292bdce7c858505b752] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.707 sec
1.730 sec 2024-05-29 11:17:15,641 [2e56edc78ecdf75b28ab5db7907b1180] WARNING spi.icache: get_or_set(IndexView.case_names) took 1.730 sec
2.457 sec 2024-05-09 00:46:37,650 [3ac8a0bb18d85994b1e6e7161749bbcf] WARNING spi.icache: get(spi.CacheableUserContribs.Cirt) took 2.457 sec
2.670 sec 2024-05-09 06:14:40,838 [16d2a11cf7cc34c977b144e710a61a4b] WARNING spi.icache: get_or_set(IndexView.case_names) took 2.670 sec
2.718 sec 2024-05-29 10:54:02,030 [3467fa1a2d25f17007fc2af69c49ee7e] WARNING spi.icache: get_or_set(IndexView.case_names) took 2.718 sec
23.535 sec 2024-06-03 16:44:35,076 [ba324874d182b82467e4a9110303c1f9] WARNING spi.icache: get_or_set(IndexView.case_names) took 23.535 sec
937.936 sec 2024-03-30 22:36:38,688 [c4a737383e95a66990e6f32cec820dd3] WARNING spi.icache: get(spi.CacheableSpiCase.KhndzorUtogh) took 937.936 sec
937.990 sec 2024-05-14 17:32:29,902 [4cdde4faa94b3d00f3fddf4ffc32477f] WARNING spi.icache: get(spi.CacheableUserContribs.Soheelmoon) took 937.990 sec
937.992 sec 2024-04-09 06:28:47,659 [376d06be152d480f8ea87fc227b80ecb] WARNING spi.icache: get(views.get_sock_names.TheDelhiBoy2) took 937.992 sec
938.057 sec 2024-05-07 00:34:52,149 [58b8a524205814bd9bbbde42aed3f531] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.057 sec
938.083 sec 2024-04-06 17:15:03,737 [fcaeca546abd5a66562f4be48e08bb7e] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.083 sec
938.114 sec 2024-04-09 09:17:45,525 [642f7b08d6a9b9fecbef51b252a8a72e] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.114 sec
938.151 sec 2024-04-13 13:54:24,605 [971928f1375663ae02ad12c6e1bb1b0d] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.151 sec
938.152 sec 2024-05-13 20:54:26,977 [142afdc6bf8d7b7a9c0018d3a46de7b0] WARNING spi.icache: get(spi.CacheableUserContribs.Hogo-2020) took 938.152 sec
938.160 sec 2024-05-18 13:38:14,915 [eb80e932ffc26a659c77b4d8afb4ef22] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.160 sec
938.162 sec 2024-03-31 17:08:03,937 [616245e7a510a744c37d1e8bf2da2d0d] WARNING spi.icache: get_or_set(IndexView.case_names) took 938.162 sec

But what's fascinating is they start at 937 sec. And the longest is 944 seconds:

940.227 sec 2024-03-26 13:13:17,911 [793f3620cadc9275b2b2928cb84b27c7] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.227 sec
940.237 sec 2024-04-22 17:53:54,505 [7a513fa021063a30b1f2f6426016458a] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.237 sec
940.243 sec 2024-05-02 02:43:32,293 [d23a35931f9381a814ccf74e3ee9b8ee] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.243 sec
940.286 sec 2024-03-25 13:34:19,656 [2acf5153891f56179fb6e6669fa913f3] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.286 sec
940.319 sec 2024-05-26 16:37:33,259 [a64ab0e549938a38fd07ae05552a8c16] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.319 sec
940.387 sec 2024-05-21 13:28:11,974 [65241051041c444c11b646bf4400264c] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.387 sec
940.391 sec 2024-05-10 20:20:18,658 [8f032b3b9d9db86fef4a8bd4d866848d] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.391 sec
940.411 sec 2024-05-10 20:20:18,574 [29715b322e7daab455856afee8580457] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.411 sec
940.547 sec 2024-05-10 20:20:18,776 [7881f8b65b651fe4ab5cbf539b8b2068] WARNING spi.icache: get_or_set(IndexView.case_names) took 940.547 sec
944.644 sec 2024-03-19 14:03:00,566 [1d1fec1c8845ab21f79f213b5a5459b5] WARNING spi.icache: get_or_set(IndexView.case_names) took 944.644 sec

That smells like multiple timeouts cascading; a 15 minute (900 second) timeout that happens after a 30 second timeout. So I'm guessing there's something that's retrying connections with increasingly longer timeouts each time. No clue if that helps figure out what's going on, but it might be useful to see if the other apps that are having this problem exhibit this same behavior.

Also, are the various apps that have this problem all using the same stack, i.e. python with django?

What makes this even weirder is I've got my cache configured with a 300 second timeout:

REDIS_CACHE = {
        'BACKEND': 'django_redis.cache.RedisCache',
        'LOCATION': 'redis://tools-redis.svc.eqiad.wmflabs:6379/0',
        'TIMEOUT': 300,
        'KEY_PREFIX': str(uuid4()),
        'OPTIONS': {
            'CLIENT_CLASS': 'django_redis.client.DefaultClient',
            'IGNORE_EXCEPTIONS': True,
        }
    }

so, even less clue what's going on.

I wonder if it could be something that's not related to Redis at all, but instead something else that blocks the application thread for a long time. Just a guess, I might be completely wrong.

I'm away until next Monday but I can do some further investigation next week.

The last error in spi-tools was on 2024-06-18, just a coincidence or have things improved for some reason? Is anyone else still having Redis connection issues?

I have been having this issue in sdzerobot tool. After many months of trouble with the node-redis client, I moved to ioredis thinking it's a client issue. But the issue persists. It appears the client sometimes spends 15+ minutes trying to reconnect after a connection loss.

I'm still seeing it too. They seem to come in clusters. There were 14 on 2024-06-11 and 5 on 2024-06-17.

Considering how long this has been going on and the number of distinct apps and client libraries that are experiencing issues, may I suggest this be escalated to a higher priority?

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Redis_for_Toolforge#Redis_containers may help some folks who are mostly using the shared Redis as queue or cache if they can also tolerate state loss on service restart.

fnegri raised the priority of this task from Medium to High.Jul 1 2024, 4:24 PM

@RoySmith escalating to high, I'll spend some more time trying to understand what is causing this issue.

In the meantime, let us know if @bd808's solution could fit your needs.

Thanks,

In theory, I think BD's solution would work. I am only using redis as a cache and data loss, while annoying, wouldn't be fatal. On the other hand, I'm resisting going in that direction because it's another moving part I'll have to manage myself, and I'm trying to avoid that if I can.

Apologies for the lack of updates on this task, it remained on my radar but I didn't find the time for a more in-depth investigation.

I checked the spi-tools logs today and the last timeout error I see is from 2024-08-07.

Are people still having issues with Redis timeouts? If so, can you please point me to some recent logs?

Looks like timeouts are visible in sample-complex-app, which is using Python and Celery:

tools.sample-complex-app@tools-bastion-13:~$ toolforge jobs logs backend-worker

[...]

2024-09-17T07:13:57+00:00 [backend-worker-56b488598c-8dngn]   File "/layers/heroku_python/dependencies/lib/python3.12/site-packages/redis/_parsers/socket.py", line 78, in _read_from_socket
2024-09-17T07:13:57+00:00 [backend-worker-56b488598c-8dngn]     raise TimeoutError("Timeout reading from socket")
2024-09-17T07:13:57+00:00 [backend-worker-56b488598c-8dngn] redis.exceptions.TimeoutError: Timeout reading from socket
fnegri changed the task status from In Progress to Stalled.Jan 13 2025, 4:31 PM

@fnegri hi! What is this currently blocked on?

This one is just "stalled" but not blocked. I'm a bit out of ideas on how to debug further.

I've just rechecked the sample-complex-app and there is one timeout in the last few days:

tools.sample-complex-app@tools-bastion-13:~$ toolforge jobs logs backend-worker |grep redis.exceptions.TimeoutError
2025-03-19T07:02:46+00:00 [backend-worker-56b488598c-7tqnn] redis.exceptions.TimeoutError: Timeout reading from socket

spi-tools has more timeouts, 25 since January:

spi-tools [~] grep -c "Connection timed out" logs/django/django.log.2025-01-08T17\:22\:23.035950
25

spi-tools [~] grep "Connection timed out" logs/django/django.log.2025-01-08T17\:22\:23.035950
2025-01-17 14:04:10,993 [455cf68a601e12caed9a194ffe3845e1] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')
2025-01-21 09:50:15,925 [d82432e1d25a8d71e6e64b9e0a8b78c3] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')

[...]

2025-03-21 07:10:30,278 [d22b6688cb9f0012c1602fa492644de5] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')
2025-03-21 07:15:29,265 [60758a5a3bc096e03ecbf1bba2043b8e] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')
2025-03-21 07:19:12,497 [931f96051fe55b09a8544a4fa5fc5d94] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')
2025-03-21 07:22:57,791 [f5d302da4651ea85f83694b9e214e2e6] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')
2025-03-24 08:06:11,588 [56a9da4aa0bcd3e2bffadf109c76d1a1] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')

Mentioned in SAL (#wikimedia-cloud) [2025-04-28T13:05:08Z] <dhinus> tools-db-5: systemctl stop mariadb && systemctl start mariadb T318479

Mentioned in SAL (#wikimedia-cloud) [2025-04-28T13:05:08Z] <dhinus> tools-db-5: systemctl stop mariadb && systemctl start mariadb T318479

I used the wrong task id, that comment belonged to T392596

This one is just "stalled" but not blocked. I'm a bit out of ideas on how to debug further.

Maybe we should try upgrading Toolforge Redis to valkey 8.0? The new version has a lot of performance improvements (see https://github.com/valkey-io/valkey/issues/22 and https://github.com/valkey-io/valkey/pull/861), which might provide better resilience in holding a large number of connections.

I know Redis 8.0 is also open-source, but considering the history, it might be more natural to migrate to Valkey which is "more open" and is maintained by many of the former redis maintainers.

fnegri changed the task status from Stalled to Open.Jul 15 2025, 1:24 PM
fnegri edited projects, added Toolforge; removed Toolforge (Toolforge iteration 22).

No one is currently working on this task, so I'll move it back to the backlog but keep it as "High" priority.

@SD0001 moving to Valkey is an interesting idea, I would create a dedicated task about the migration.

Is this still happening? I suspect the problem was related to k8s workers periodically renewing their DHCP lease and briefly losing the network. See also T400223: Investigate daily disconnections of IRC bots hosted in Toolforge

fnegri claimed this task.

Interesting theory. I checked the error logs for the same two tools I had checked back in March (https://phabricator.wikimedia.org/T318479#10670161), sample-complex-app is not showing any logs for unrelated reasons, while spi-tools shows only one error in the last 6 months.

tools.sample-complex-app@tools-bastion-15:~$ toolforge jobs logs backend-worker
ERROR: Job 'backend-worker' does not have any logs available

spi-tools:~$ grep -c "Connection timed out" logs/django/django.log.2025*
logs/django/django.log.2025-01-08T17:22:23.035950:29
logs/django/django.log.2025-04-02T12:37:44.128016:0
logs/django/django.log.2025-04-03T20:44:29.803471:0
logs/django/django.log.2025-04-04T00:18:03.798197:0
logs/django/django.log.2025-04-07T12:43:07.972109:0
logs/django/django.log.2025-04-07T13:04:54.122624:0
logs/django/django.log.2025-04-10T15:35:22.979422:3
logs/django/django.log.2025-04-17T02:14:55.379432:23
logs/django/django.log.2025-06-06T15:14:44.035477:0
logs/django/django.log.2025-06-19T01:42:21.388734:0
logs/django/django.log.2025-06-19T14:54:54.059516:0
logs/django/django.log.2025-07-29T15:33:01.126584:0
logs/django/django.log.2025-08-21T07:39:01.808821:0
logs/django/django.log.2025-08-21T07:56:00.824283:0
logs/django/django.log.2025-09-08T10:30:23.275158:0
logs/django/django.log.2025-09-13T03:51:39.058440:1
logs/django/django.log.2025-10-13T08:09:07.406502:0
logs/django/django.log.2025-10-13T08:36:49.922413:0

spi-tools:~$ grep "Connection timed out" logs/django/django.log.2025-09-13T03\:51\:39.058440
2025-09-30 22:36:27,682 [1cc7aac21ed63003c31711813ea744c7] ERROR tools_app.redis: Redis ConnectionError: Error while reading from tools-redis.svc.eqiad.wmflabs:6379 : (110, 'Connection timed out')

I will optimistically mark this as Resolved.