At https://quarry.wmflabs.org/query/runs/all, I noticed that the latest queries have been left in the "running" state for hours without ever being marked as "completed".
Please fix this as soon as you could.
At https://quarry.wmflabs.org/query/runs/all, I noticed that the latest queries have been left in the "running" state for hours without ever being marked as "completed".
Please fix this as soon as you could.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
connection handling: correct closing of connections | analytics/quarry/web | master | +6 -2 |
Change 675203 had a related patch set uploaded (by Bstorm; author: Bstorm):
[analytics/quarry/web@master] connection handling: correct closing of connections
Basically, I added a cleanup of the connection attribute the other day without some important safeguards, and I think this is killing workers.
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: [2021-03-26 18:06:46,822: ERROR/ForkPoolWorker-15] Task worker.run_query[9a99fc39-d4c9-4a4a-a05e-63eb08218582] raise d unexpected: Error('Already closed',) Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: Traceback (most recent call last): Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: File "/srv/quarry/quarry/web/worker.py", line 67, in run_query Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: repl.connection = qrun.rev.query_database Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: File "/srv/quarry/quarry/web/replica.py", line 51, in connection Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: self._replica.close() Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 356, in close Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: raise err.Error("Already closed") Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: pymysql.err.Error: Already closed
Change 675203 merged by jenkins-bot:
[analytics/quarry/web@master] connection handling: correct closing of connections
Mentioned in SAL (#wikimedia-cloud) [2021-03-26T19:27:37Z] <bstorm> deploying changes to the replica class and restarting things T278544
https://quarry.wmflabs.org/query/53634 completed, and I don't see any workers killed this time.
I think I stopped the workers from dying. The code should do a better job of cleaning up connections without trying to close already-closed connections now.
Please reopen if all queries start failing to complete. That would suggest all the workers are dying. If just one query gets apparently stuck, that might just mean the worker ran out of memory, which is another issue entirely.
If there isn't a task already around for it, I'll make one to have the web detect when workers die.