Page MenuHomePhabricator

celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log
Open, NormalPublic0 Story Points

Description

Today three celery-ores-worker services failed at around the same time on ores100[2,4,5]. From the logs the only thing that I can see is:

Aug 06 17:36:09 ores1005 systemd[1]: Stopping Celery workers...
Aug 06 17:36:11 ores1005 celery-ores-worker[23634]: worker: Warm shutdown (MainProcess)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:  -------------- celery@ores1005 v4.1.1 (latentcall)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: ---- **** -----
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- * ***  * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-07-17 20:34:30
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- * - **** ---
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- [config]
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> app:         ores.scoring_systems.celery_queue:0x7f17b5defd68
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> transport:   redis://:**@oresrdb.svc.eqiad.wmnet:6379//
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> results:     redis://:**@oresrdb.svc.eqiad.wmnet:6379/
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - *** --- * --- .> concurrency: 90 (prefork)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- ***** -----
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:  -------------- [queues]
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:                 .> celery           exchange=celery(direct) key=celery
Aug 06 17:36:12 ores1005 celery-ores-worker[23634]:
Aug 06 17:36:13 ores1005 systemd[1]: Stopped Celery workers.
Aug 06 17:36:13 ores1005 systemd[1]: Started Celery workers.
Aug 06 17:36:41 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Aug 06 17:36:42 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.


Aug 21 11:31:12 ores1005 celery-ores-worker[13281]: /srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: h
Aug 21 11:31:12 ores1005 celery-ores-worker[13281]:   'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:  -------------- celery@ores1005 v4.1.1 (latentcall)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: ---- **** -----
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- * ***  * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-08-06 17:36:46
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- * - **** ---
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- [config]
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> app:         ores.scoring_systems.celery_queue:0x7f6dab334ba8
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> transport:   redis://:**@oresrdb.svc.eqiad.wmnet:6379//
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> results:     redis://:**@oresrdb.svc.eqiad.wmnet:6379/
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - *** --- * --- .> concurrency: 90 (prefork)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- ***** -----
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:  -------------- [queues]
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:                 .> celery           exchange=celery(direct) key=celery
Aug 21 11:31:14 ores1005 celery-ores-worker[13281]:
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Main process exited, code=exited, status=1/FAILURE
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Unit entered failed state.
Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Failed with result 'exit-code'.
Aug 21 11:38:28 ores1005 systemd[1]: Started Celery workers.
Aug 21 11:38:40 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Aug 21 11:38:41 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.

The code path doesn't seem to lead to any shutdown, and the rest seems to be the log that celery emits when shutting down (it can be seen before in the logs).

Also we got alarmed by the systemd unit failed alert, not by any celery specific ones (not sure if we don't alarm on purpose or if we are missing some monitor).

Event Timeline

elukey created this task.Aug 21 2019, 12:06 PM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptAug 21 2019, 12:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey renamed this task from celery-ores-worker service failed on ores100[2,3,5] without any apparent reason or significant log to celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log.Aug 21 2019, 12:07 PM
elukey updated the task description. (Show Details)
jbond added a subscriber: jbond.Aug 21 2019, 12:08 PM
Halfak added a subscriber: Halfak.Aug 21 2019, 1:54 PM

It looks to me like all of this log output is actually from celery starting back up.

I wonder what the logs said before this.

Either way, we need to have a monitor on this.

elukey updated the task description. (Show Details)Aug 21 2019, 1:59 PM

It looks to me like all of this log output is actually from celery starting back up.
I wonder what the logs said before this.
Either way, we need to have a monitor on this.

I updated the task descriptions, systemd logs a shutdown right after the msg.. Moreover we had to manually start celery.

elukey updated the task description. (Show Details)Aug 21 2019, 2:01 PM

On ores1002, I see the following in app.log:

2019-08-21 11:31:10,673 ERROR celery.worker.pidbox: Control command error: OperationalError("\nCannot route mess
age for exchange 'reply.celery.pidbox': Table empty or key no longer exists.\nProbably the key ('_kombu.binding.
reply.celery.pidbox') has been removed from the Redis database.\n",)
Traceback (most recent call last):
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/connection.py", line 431, in _reraise_as_library_errors
    yield
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/connection.py", line 510, in _ensured
    return fun(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/base.py", line 605, in basic_publish
    message, exchange, routing_key, **kwargs
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/exchange.py", line 70, in deliver
    for queue in _lookup(exchange, routing_key):
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/virtual/base.py", line 714, in _lookup
    self.get_table(exchange),
  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/kombu/transport/redis.py", line 829, in get_table
    raise InconsistencyError(NO_ROUTE_ERROR.format(exchange, key))
kombu.exceptions.InconsistencyError: 
Cannot route message for exchange 'reply.celery.pidbox': Table empty or key no longer exists.
Probably the key ('_kombu.binding.reply.celery.pidbox') has been removed from the Redis database.

The only time I have seen something like this before is when redis gets it's keys flushed while it is in operation.

I see the same error on ores1006. But celery is clearly still running there.

Halfak added a comment.EditedAug 21 2019, 2:31 PM

But on ores1006, the top-level error is:

redis.exceptions.TimeoutError: Timeout reading from oresrdb.svc.eqiad.wmnet:6379

While on ores1002, the top-level error is:

  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-
packages/redis/client.py", line 3164, in parse_response
    'pubsub connection not set: '
RuntimeError: pubsub connection not set: did you forget to call subscribe() or psubscribe()?

Followed by:

  File "/srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/copy.
py", line 174, in deepcopy
    rv = reductor(4)
TypeError: can't pickle _thread.lock objects

I'm guessing that what happened is that redis had a hiccup. ores100[1,3,6,7,8,9] all hit a timeout error and recovered as soon as redis was available again. But ores100[2,4,5] hit a deeper error while trying to recover and then failed during some deep copy that hit a _thread.lock object. I'm not sure if addressing T122676: Implement sentinel for ORES production Redis would have fixed this or not. But right now, I think fixing our monitoring is going to be most important.

Joe added a subscriber: Joe.Aug 22 2019, 10:11 AM

I think this is a reasonable explanation, but how would you suggest we should fix our monitoring?

Joe claimed this task.Aug 22 2019, 10:11 AM
Joe triaged this task as Normal priority.

I'm looking into what it would take to monitor a celery worker pool on a specific machine. It looks like celery has a lot of utilities that might help. See https://docs.celeryproject.org/en/latest/userguide/monitoring.html#management-command-line-utilities-inspect-control

Right now, I want to run some experiments on my dev machine to see what it looks like when celery is running/stopped.

Halfak added a comment.EditedAug 22 2019, 2:50 PM

So, I've been trying explore the behaviors of celery -A ores_celery inspect ping to see if it will work for us as a up-check. So far, I have determined that there is no real consistent behavior. Sometimes when I run this "ping" command, it will just hang indefinitely. Other times, it will report that celery didn't respond in time when it is actually online! I'm honestly not sure what the purpose of this utility is!

celery -A ores_celery inspect stats seems to have better behavior while celery is running, but it just hangs when celery is offline.

Worse, this utility loads our entire "application" into memory before it can function which means that it is taking a huge bite out of memory and using ton of CPU to deserialize all of our models. So, I don't like it. I wonder if a simple check on ps aux | grep celery | grep -v grep | wc -l is better. Maybe we can set some bounds on our expected response. We should expect # of works plus 1. So 16 workers = 17 processes returned. @Joe, what do you think of that?

Halfak added a comment.Fri, Sep 6, 9:24 PM

@Joe, do you think counting processes using ps would make for a good check?

Halfak added a subscriber: Dzahn.Wed, Sep 11, 9:11 PM

@Dzahn, what would it take to implement a check like the one I described above? Is there a way that we could reach the ORES config in the check so that we can update the # of workers in one spot only?