Page MenuHomePhabricator

Systemd services rely on cron to restart
Open, Needs TriagePublic5 Estimated Story Points

Description

We're finding the systemd services, specifically ww_events_stream.service and ww_events_stream_deletion.service, may periodically get "stuck" and never recover. All along the while WikiWho will miss every edit it's supposed to be processing, and those articles will return the "Requested data is not currently available in WikiWho database. It will be available soon." error. You can check the status of the services and it reports that all is fine, only that it noticeably it isn't actually processing any edits.

We're now relying on a cronjob to restart the services daily. This works for now, but ideally we'd figure out the root of the problem.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dmaza set the point value for this task to 5.Aug 24 2023, 5:32 PM
tstarling renamed this task from Systemd services abrutly failing and not restarting to Systemd services abruptly failing and not restarting.Aug 30 2023, 10:41 AM

nb. whilst debugging the following changes were made:

  • [celery] Set default_task_soft_time_limit to 120s
  • [celery] (Re)set user_task_soft_time_limit to 3600s
  • [flower] Set persistent to True
  • [flower] Add exception column

I've noticed that worker_long has failed at least once without recovering

Hm, if worker_default/_long/_user were independent systemd services (instead of brought up together via ww_celery) we could be a little more refined in how we (potentially) restart/check services..

We received several complaints of WikiWho clients not working. Sure enough, the eventstream service and celery itself went down about a week ago. I've restarted all, and as a quick stopgap measure, I'm going add a cronjob to restart them daily.

TheresNoTime raised the priority of this task from High to Needs Triage.Oct 23 2023, 12:30 PM

Ah thanks @MusikAnimal — I've lowered the priority of this task, now that something is ensuring it restarts the service.
Shall we leave this open for investigation?

Ah thanks @MusikAnimal — I've lowered the priority of this task, now that something is ensuring it restarts the service.
Shall we leave this open for investigation?

I'd love to figure out what's actually going wrong, but I feel we've put enough time into this.

Let's leave it open on WikiWho but take it off the Kanban. I'll also retitle accordingly.

MusikAnimal renamed this task from Systemd services abruptly failing and not restarting to Systemd services rely on cron to restart.Oct 23 2023, 9:15 PM
MusikAnimal updated the task description. (Show Details)