Page MenuHomePhabricator

celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log
Closed, ResolvedPublic0 Estimated Story Points

Description

Today three celery-ores-worker services failed at around the same time on ores100[2,4,5]. From the logs the only thing that I can see is:

Aug 06 17:36:09 ores1005 systemd[1]: Stopping Celery workers...
Aug 06 17:36:11 ores1005 celery-ores-worker[23634]: worker: Warm shutdown (MainProcess)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:  -------------- celery@ores1005 v4.1.1 (latentcall)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: ---- **** -----
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- * ***  * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-07-17 20:34:30
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- * - **** ---
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- [config]
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> app:         ores.scoring_systems.celery_queue:0x7f17b5defd68
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> transport:   redis://:**@oresrdb.svc.eqiad.wmnet:6379//
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> results:     redis://:**@oresrdb.svc.eqiad.wmnet:6379/
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - *** --- * --- .> concurrency: 90 (prefork)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- ***** -----
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:  -------------- [queues]
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:                 .> celery           exchange=celery(direct) key=celery
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:
Aug 06 17:36:13 ores1005 systemd[1]: Stopped Celery workers.
Aug 06 17:36:13 ores1005 systemd[1]: Started Celery workers.
Aug 06 17:36:41 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Aug 06 17:36:42 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.


Aug 21 11:31:12 ores1005 celery-ores-worker[13281]: /srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: h
Aug 21 11:31:12 ores1005 celery-ores-worker[13281]:   'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:  -------------- celery@ores1005 v4.1.1 (latentcall)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: ---- **** -----
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- * ***  * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-08-06 17:36:46
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- * - **** ---
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- [config]
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> app:         ores.scoring_systems.celery_queue:0x7f6dab334ba8
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> transport:   redis://:**@oresrdb.svc.eqiad.wmnet:6379//
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> results:     redis://:**@oresrdb.svc.eqiad.wmnet:6379/
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - *** --- * --- .> concurrency: 90 (prefork)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- ***** -----
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:  -------------- [queues]
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:                 .> celery           exchange=celery(direct) key=celery
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Unit entered failed state.
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Failed with result 'exit-code'.
Aug 21 11:38:28 ores1005 systemd[1]: Started Celery workers.
Aug 21 11:38:40 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Aug 21 11:38:41 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.

The code path doesn't seem to lead to any shutdown, and the rest seems to be the log that celery emits when shutting down (it can be seen before in the logs).

Also we got alarmed by the systemd unit failed alert, not by any celery specific ones (not sure if we don't alarm on purpose or if we are missing some monitor).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey renamed this task from celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log to celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log.Aug 21 2019, 12:07 PM
elukey updated the task description. (Show Details)

It looks to me like all of this log output is actually from celery starting back up.

I wonder what the logs said before this.

Either way, we need to have a monitor on this.

It looks to me like all of this log output is actually from celery starting back up.

I wonder what the logs said before this.

Either way, we need to have a monitor on this.

I updated the task descriptions, systemd logs a shutdown right after the msg.. Moreover we had to manually start celery.

On ores1002, I see the following in app.log:

2019-08-21 11:31:10,673 ERROR celery.worker.pidbox: Control command error: OperationalError("\nCannot route mess
age for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.
reply.celery.pidbox') has been removed from the Redis database.\n",)
Traceback (most recent call last):
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/connection.py", line 431, in _reraise_as_library_errors
    yield
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/connection.py", line 510, in _ensured
    return fun(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/base.py", line 605, in basic_publish
    message, exchange, routing_key, **kwargs
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/exchange.py", line 70, in deliver
    for queue in _lookup(exchange, routing_key):
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/base.py", line 714, in _lookup
    self.get_table(exchange),
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/redis.py", line 829, in get_table
    raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))
kombu.exceptions.InconsistencyError: 
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

The only time I have seen something like this before is when redis gets it's keys flushed while it is in operation.

I see the same error on ores1006. But celery is clearly still running there.

But on ores1006, the top-level error is:

redis.exceptions.TimeoutError: Timeout reading from oresrdb.svc.eqiad.wmnet:6379

While on ores1002, the top-level error is:

  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/redis/client.py", line 3164, in parse_response
    'pubsub connection not set: '
RuntimeError: pubsub connection not set: did you forget to call subscribe() or psubscribe()?

Followed by:

  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/copy.
py", line 174, in deepcopy
    rv = reductor(4)
TypeError: can't pickle _thread.lock objects

I'm guessing that what happened is that redis had a hiccup. ores100[1,3,6,7,8,9] all hit a timeout error and recovered as soon as redis was available again. But ores100[2,4,5] hit a deeper error while trying to recover and then failed during some deep copy that hit a _thread.lock object. I'm not sure if addressing T122676: Implement sentinel for ORES production Redis would have fixed this or not. But right now, I think fixing our monitoring is going to be most important.

I think this is a reasonable explanation, but how would you suggest we should fix our monitoring?

Joe triaged this task as Medium priority.

I'm looking into what it would take to monitor a celery worker pool on a specific machine. It looks like celery has a lot of utilities that might help. See https://docs.celeryproject.org/en/latest/userguide/monitoring.html#management-command-line-utilities-inspect-control

Right now, I want to run some experiments on my dev machine to see what it looks like when celery is running/stopped.

So, I've been trying explore the behaviors of celery -A ores_celery inspect ping to see if it will work for us as a up-check. So far, I have determined that there is no real consistent behavior. Sometimes when I run this "ping" command, it will just hang indefinitely. Other times, it will report that celery didn't respond in time when it is actually online! I'm honestly not sure what the purpose of this utility is!

celery -A ores_celery inspect stats seems to have better behavior while celery is running, but it just hangs when celery is offline.

Worse, this utility loads our entire "application" into memory before it can function which means that it is taking a huge bite out of memory and using ton of CPU to deserialize all of our models. So, I don't like it. I wonder if a simple check on ps aux | grep celery | grep -v grep | wc -l is better. Maybe we can set some bounds on our expected response. We should expect # of works plus 1. So 16 workers = 17 processes returned. @Joe, what do you think of that?

@Joe, do you think counting processes using ps would make for a good check?

@Dzahn, what would it take to implement a check like the one I described above? Is there a way that we could reach the ORES config in the check so that we can update the # of workers in one spot only?

Change 540155 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Alert on aberrant amount of celery workers

https://gerrit.wikimedia.org/r/540155

Change 540155 merged by Giuseppe Lavagetto:
[operations/puppet@production] ores: Alert on aberrant amount of celery workers

https://gerrit.wikimedia.org/r/540155

First of all, apologies for losing track of this task. What you were asking was indeed pretty simple to do - and should have been in puppet already in the first place.

The change by @akosiaris introduces a check on the number of running celery workers. It should be enough to make us notice immediately if something like that goes wrong again.

That should be live in the next hour or so on all ORES workers.

Thanks @Joe and no worries. I'm happy to move this one off our backlog :) Thank you both for getting it done.

I 'll resolve this, all workers have now the check, with 91 children each, we will be alerted if this deviates too much from the configured thresholds

DannyS712 subscribed.

[batch] remove patch for review tag from resolved tasks