Page MenuHomePhabricator

Queries left in "running" state for hours
Closed, ResolvedPublic

Description

At https://quarry.wmflabs.org/query/runs/all, I noticed that the latest queries have been left in the "running" state for hours without ever being marked as "completed".

Please fix this as soon as you could.

Event Timeline

Change 675203 had a related patch set uploaded (by Bstorm; author: Bstorm):
[analytics/quarry/web@master] connection handling: correct closing of connections

https://gerrit.wikimedia.org/r/675203

Basically, I added a cleanup of the connection attribute the other day without some important safeguards, and I think this is killing workers.

Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: [2021-03-26 18:06:46,822: ERROR/ForkPoolWorker-15] Task worker.run_query[9a99fc39-d4c9-4a4a-a05e-63eb08218582] raise
d unexpected: Error('Already closed',)
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: Traceback (most recent call last):
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/quarry/web/worker.py", line 67, in run_query
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     repl.connection = qrun.rev.query_database
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/quarry/web/replica.py", line 51, in connection
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     self._replica.close()
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 356, in close
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     raise err.Error("Already closed")
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: pymysql.err.Error: Already closed

Change 675203 merged by jenkins-bot:
[analytics/quarry/web@master] connection handling: correct closing of connections

https://gerrit.wikimedia.org/r/675203

Mentioned in SAL (#wikimedia-cloud) [2021-03-26T19:27:37Z] <bstorm> deploying changes to the replica class and restarting things T278544

https://quarry.wmflabs.org/query/53634 completed, and I don't see any workers killed this time.

I think I stopped the workers from dying. The code should do a better job of cleaning up connections without trying to close already-closed connections now.

Please reopen if all queries start failing to complete. That would suggest all the workers are dying. If just one query gets apparently stuck, that might just mean the worker ran out of memory, which is another issue entirely.

If there isn't a task already around for it, I'll make one to have the web detect when workers die.