Page MenuHomePhabricator

Celery manager implodes horribly if Redis goes down
Closed, ResolvedPublic

Description

Looks like celery does something to cause itself an OOM if Redis goes down. This needs to be graceful, log correctly, and sit around retrying.

Also, we need to document that currently all celery workers must be *manually* restarted (scap --service-restart doesn't work) if the Redis server is rebooted.

Event Timeline

Ottomata triaged this task as Medium priority.Jan 16 2018, 8:07 PM
Harej raised the priority of this task from Medium to High.Aug 3 2018, 7:05 PM

@Harej Thanks for the bump, this scares me too.

After celery4 this doesn't happen anymore. It recovers properly. Tested on my localhost. The response is also proper:
2018-11-02 20:43:00,027 WARNING:celery.worker.consumer.consumer -- consumer: Connection to broker lost. Trying to re-establish the connection...

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 177, in _read_from_socket
    raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2408, in _execute
    return command(*args)
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 284, in read_response
    response = self._buffer.readline()
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 216, in readline
    self._read_from_socket()
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 191, in _read_from_socket
    (e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 484, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 541, in _connect
    raise err
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 529, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/celery/worker/consumer/consumer.py", line 316, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.6/dist-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.6/dist-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python3.6/dist-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/usr/local/lib/python3.6/dist-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/usr/local/lib/python3.6/dist-packages/kombu/transport/redis.py", line 1040, in on_readable
    self.cycle.on_readable(fileno)
  File "/usr/local/lib/python3.6/dist-packages/kombu/transport/redis.py", line 337, in on_readable
    chan.handlers[type]()
  File "/usr/local/lib/python3.6/dist-packages/kombu/transport/redis.py", line 667, in _receive
    ret.append(self._receive_one(c))
  File "/usr/local/lib/python3.6/dist-packages/kombu/transport/redis.py", line 678, in _receive_one
    response = c.parse_response()
  File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2430, in parse_response
    return self._execute(connection, connection.read_response)
  File "/usr/local/lib/python3.6/dist-packages/redis/client.py", line 2415, in _execute
    connection.connect()
  File "/usr/local/lib/python3.6/dist-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379. Connection refused.
2018-11-02 20:43:00,045 ERROR:celery.worker.consumer.consumer -- consumer: Cannot connect to redis://localhost:6379//: Error 111 connecting to localhost:6379. Connection refused..
Trying again in 2.00 seconds...

Shall we call this resolved or we should try bringing down redis in production? 😈

Mentioned in SAL (#wikimedia-releng) [2018-11-02T19:54:46Z] <Amir1> stopping redis-server on deployment-ores01, this would make ores go down in beta cluster. Testing for T181632

After celery4 this doesn't happen anymore. It recovers properly.

What great news, thanks!

Ladsgroup moved this task from Completed to Parked on the Machine-Learning-Team (Active Tasks) board.

it's interesting, without much requests, it will be fine but when it's under lots of requests and redis goes down, two things happen: 1- PoolCounter won't get released but since it has timeout of 15 seconds, meh. 2- uwsgi can't send requests to celery. Celery itself doesn't need anything but uwsgi needs restart. I'm thinking it might be because of redis cache or something else specially celery.errors.Timeout doesn't happen and another timeout bubbles up. Will investigate this more tomorrow.

it's interesting, without much requests, it will be fine but when it's under lots of requests and redis goes down, two things happen: 1- PoolCounter won't get released but since it has timeout of 15 seconds, meh. 2- uwsgi can't send requests to celery. Celery itself doesn't need anything but uwsgi needs restart. I'm thinking it might be because of redis cache or something else specially celery.errors.Timeout doesn't happen and another timeout bubbles up. Will investigate this more tomorrow.

Okay, the whole reason behind it is that poolcounter goes crazy. I didn't need to restart uwsgi, I was able to make it recover by restarting poolcounter. Labs instance and my localhost didn't need anything and it was able to just recover just fine (because they don't use poolcounter I guess). This would probably fix PoolCounter craziness in case of errors: https://github.com/wikimedia/ores/pull/283

Tested on beta cluster and now it fully recovers under any situation \o/ Will deploy this on Monday.

Mentioned in SAL (#wikimedia-operations) [2018-11-05T12:33:32Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@096ffb3]: T208577 T181632 T208608 (duration: 22m 58s)