Queries left in "running" state for hours
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	GTrang
	Mar 26 2021, 1:46 PM

Description

At https://quarry.wmflabs.org/query/runs/all, I noticed that the latest queries have been left in the "running" state for hours without ever being marked as "completed".

Please fix this as soon as you could.

Details

	Subject	Repo	Branch	Lines +/-
	connection handling: correct closing of connections	analytics/quarry/web	master	+6 -2

Customize query in gerrit

Related Objects

Mentioned In: T274071: Quarry queries forever stuck in queue
T264254: Prepare Quarry for multiinstance wiki replicas

Event Timeline

GTrang created this task.Mar 26 2021, 1:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2021, 1:46 PM

GTrang mentioned this in T264254: Prepare Quarry for multiinstance wiki replicas.Mar 26 2021, 1:48 PM

KylieTastic subscribed.Mar 26 2021, 3:41 PM

BrandonXLF subscribed.Mar 26 2021, 3:49 PM

Alicia_Fagerving_WMSE subscribed.Mar 26 2021, 4:31 PM

Pppery subscribed.Mar 26 2021, 5:10 PM

Bdijkstra subscribed.Mar 26 2021, 6:14 PM

CommanderWaterford subscribed.Mar 26 2021, 6:28 PM

Suriname0 subscribed.Mar 26 2021, 6:46 PM

• Bstorm claimed this task.Mar 26 2021, 7:05 PM

Change 675203 had a related patch set uploaded (by Bstorm; author: Bstorm):
[analytics/quarry/web@master] connection handling: correct closing of connections

https://gerrit.wikimedia.org/r/675203

gerritbot added a project: Patch-For-Review.Mar 26 2021, 7:17 PM

Basically, I added a cleanup of the connection attribute the other day without some important safeguards, and I think this is killing workers.

Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: [2021-03-26 18:06:46,822: ERROR/ForkPoolWorker-15] Task worker.run_query[9a99fc39-d4c9-4a4a-a05e-63eb08218582] raise
d unexpected: Error('Already closed',)
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: Traceback (most recent call last):
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/quarry/web/worker.py", line 67, in run_query
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     repl.connection = qrun.rev.query_database
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/quarry/web/replica.py", line 51, in connection
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     self._replica.close()
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:   File "/srv/quarry/venv/lib/python3.5/site-packages/pymysql/connections.py", line 356, in close
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]:     raise err.Error("Already closed")
Mar 26 18:06:46 quarry-worker-02 celery-quarry-worker[31592]: pymysql.err.Error: Already closed

Change 675203 merged by jenkins-bot:
[analytics/quarry/web@master] connection handling: correct closing of connections

https://gerrit.wikimedia.org/r/675203

Mentioned in SAL (#wikimedia-cloud) [2021-03-26T19:27:37Z] <bstorm> deploying changes to the replica class and restarting things T278544

https://quarry.wmflabs.org/query/53634 completed, and I don't see any workers killed this time.

I think I stopped the workers from dying. The code should do a better job of cleaning up connections without trying to close already-closed connections now.

Please reopen if all queries start failing to complete. That would suggest all the workers are dying. If just one query gets apparently stuck, that might just mean the worker ran out of memory, which is another issue entirely.

If there isn't a task already around for it, I'll make one to have the web detect when workers die.

• Bstorm mentioned this in T274071: Quarry queries forever stuck in queue.Mar 26 2021, 7:41 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 26 2021, 8:10 PM

Queries left in "running" state for hoursClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Queries left in "running" state for hours
Closed, ResolvedPublic
Actions