Today three celery-ores-worker services failed at around the same time on ores100[2,4,5]. From the logs the only thing that I can see is:
Aug 06 17:36:09 ores1005 systemd[1]: Stopping Celery workers... Aug 06 17:36:11 ores1005 celery-ores-worker[23634]: worker: Warm shutdown (MainProcess) Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -------------- celery@ores1005 v4.1.1 (latentcall) Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: ---- **** ----- Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- * *** * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-07-17 20:34:30 Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- * - **** --- Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- [config] Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> app: ores.scoring_systems.celery_queue:0x7f17b5defd68 Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> transport: redis://:**@oresrdb.svc.eqiad.wmnet:6379// Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - ** ---------- .> results: redis://:**@oresrdb.svc.eqiad.wmnet:6379/ Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: - *** --- * --- .> concurrency: 90 (prefork) Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: --- ***** ----- Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: -------------- [queues] Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: .> celery exchange=celery(direct) key=celery Aug 06 17:36:12 ores1005 celery-ores-worker[23634]: Aug 06 17:36:13 ores1005 systemd[1]: Stopped Celery workers. Aug 06 17:36:13 ores1005 systemd[1]: Started Celery workers. Aug 06 17:36:41 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes. Aug 06 17:36:42 ores1005 celery-ores-worker[13281]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes. Aug 21 11:31:12 ores1005 celery-ores-worker[13281]: /srv/deployment/ores/deploy-cache/revs/d08fa628aacb82529dbb4be357b68dd55c15fdee/venv/lib/python3.5/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: h Aug 21 11:31:12 ores1005 celery-ores-worker[13281]: 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -------------- celery@ores1005 v4.1.1 (latentcall) Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: ---- **** ----- Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- * *** * -- Linux-4.9.0-9-amd64-x86_64-with-debian-9.9 2019-08-06 17:36:46 Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- * - **** --- Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- [config] Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> app: ores.scoring_systems.celery_queue:0x7f6dab334ba8 Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> transport: redis://:**@oresrdb.svc.eqiad.wmnet:6379// Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - ** ---------- .> results: redis://:**@oresrdb.svc.eqiad.wmnet:6379/ Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: - *** --- * --- .> concurrency: 90 (prefork) Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: --- ***** ----- Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: -------------- [queues] Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: .> celery exchange=celery(direct) key=celery Aug 21 11:31:14 ores1005 celery-ores-worker[13281]: Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Main process exited, code=exited, status=1/FAILURE Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Unit entered failed state. Aug 21 11:31:16 ores1005 systemd[1]: celery-ores-worker.service: Failed with result 'exit-code'. Aug 21 11:38:28 ores1005 systemd[1]: Started Celery workers. Aug 21 11:38:40 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes. Aug 21 11:38:41 ores1005 celery-ores-worker[25637]: Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
The code path doesn't seem to lead to any shutdown, and the rest seems to be the log that celery emits when shutting down (it can be seen before in the logs).
Also we got alarmed by the systemd unit failed alert, not by any celery specific ones (not sure if we don't alarm on purpose or if we are missing some monitor).