Page MenuHomePhabricator

Quarry should detect a dead worker and report something better than "running" forever
Open, MediumPublicBUG REPORT

Description

Quarry currently will report that something is "running" forever, which makes it seem like there's a stuck queue job or that the database is allowing it to run forever when anything happens to a celery worker. This is quite misleading. There must be some way to detect that the worker is gone/dead/broken and return something better ("killed" would work, but a more explicit status would be better).

This would promote better bug reporting when things are down and shorten time to recover. It would also help users understand what's happening in general if they ran a query that fills the system RAM or something that isn't immediately fixable.

Event Timeline

Bstorm triaged this task as Medium priority.Mar 26 2021, 7:51 PM
Bstorm created this task.

Hmm. Is the goal trying to find when a worker gets SIGKILL-ed? Celery does
internally detect when a worker dies, as per the logs, but I did not figure
out how to hook it so that it would report to the db.

At least one occurrence of this error can be reproduced with a simple query like "Select page_title, page_title from page where page_id = 1", see T265155 or look at https://quarry.wmflabs.org/query/53652

Hmm. Is the goal trying to find when a worker gets SIGKILL-ed? Celery does
internally detect when a worker dies, as per the logs, but I did not figure
out how to hook it so that it would report to the db.

When it gets SIGKILL-ed or maybe also if other things happen...but mostly when it gets SIGKILL-ed because that's what I see happening regularly. If there's some kind of hook to place in a try: except: raise that'd be slick in general as well because anything that does cause exception also causes the confusing state just like SIGKILL does.

Thanks @Wurgl, that reproducible example should help.

Overall, this is a bug that has it's own workboard column, so it seems like a place to spend some time if it has been confusing people for that many years.

Workers sometimes die unexpectedly, and leave the db in the state described "Running". One method to clean this up may be to inspect the running processes on the quarry database. As a dead worker will not have a running process, Any process that is in the running state and is older than the oldest process in quarry can be updated to "failed" from "running". I believe this will clean out any not really running processes that are older than an hour.

Aklapper changed the subtype of this task from "Task" to "Bug Report".Jul 18 2023, 12:20 PM