Error 500 when clicking "stop query"
Open, Needs TriagePublicBUG REPORT
Actions

Assigned To

None

Authored By

	Novem_Linguae
	Wed, Apr 10, 6:13 AM

Description

Steps to replicate the issue (include links if applicable):

Log in as Novem Linguae
Visit https://quarry.wmcloud.org/query/81904
Click "Stop Query"

What happens?:

What should have happened instead?:

Query stops

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Had this happen a couple times last night too
About an hour later, I clicked the "stop query" button and got error 500 again. but then I clicked the "submit query" button, which had not been working before, and the query finally got unstuck.
I also see the behavior of query status = queued, then getting stuck unless the user does something to resubmit the query

Related Objects

Duplicates Merged Here: T363644: [bug] Internal server error & backed up queue
T362891: [bug] Internal Server Error when trying to Stop Query

Event Timeline

Novem_Linguae created this task.Wed, Apr 10, 6:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Apr 10, 6:13 AM

Novem_Linguae updated the task description. (Show Details)Wed, Apr 10, 7:37 AM

Novem_Linguae updated the task description. (Show Details)Wed, Apr 10, 7:44 AM

vivian-rook opened https://github.com/toolforge/quarry/pull/38

I'm guessing this issue is from the threads being in separate pods. The attached PR removes the feature which may be the way to go with it.

I don't think that's the issue. We persist the db process id in the query_run table, so even a different pod is able to execute KILL <id> on the db to get the query to stop.

The issue I suspect is that *.analytics.db.svc.eqiad.wmflabs are LB endpoints behind which there could be multiple replicas (@taavi - would you be able to confirm if this is the case?). The KILL command wouldn't work if received by a different replica than the one on which query is running. This is also the reason the automated kills after the 30-minute timeout are unreliable nowadays.

In T362213#9705787, @SD0001 wrote:

The issue I suspect is that *.analytics.db.svc.eqiad.wmflabs are LB endpoints behind which there could be multiple replicas (@taavi - would you be able to confirm if this is the case?).

No, not usually. (The long answer is that there's generally one analytics and one web replica per section, and while maintenance sometimes means that we send all traffic to a single server, it should not lead to a situation where a single endpoint is routed to multiple replicas at the same time.)

SD0001 merged a task: T362891: [bug] Internal Server Error when trying to Stop Query.Thu, Apr 18, 2:45 PM

SD0001 added subscribers: Ahecht, komla, dcaro.

rook merged a task: T363644: [bug] Internal server error & backed up queue.Mon, Apr 29, 7:54 PM

rook added a subscriber: Tom.Reding.

Error 500 when clicking "stop query"Open, Needs TriagePublicBUG REPORTActions

Description

Related Objects

Event Timeline

Error 500 when clicking "stop query"
Open, Needs TriagePublicBUG REPORT
Actions