Page MenuHomePhabricator

Error 500 when clicking "stop query"
Open, Needs TriagePublicBUG REPORT

Assigned To
None
Authored By
Novem_Linguae
Wed, Apr 10, 6:13 AM
Referenced Files
F45536070: image.png
Wed, Apr 10, 6:13 AM
F45536300: image.png
Wed, Apr 10, 6:13 AM
F45536322: image.png
Wed, Apr 10, 6:13 AM

Description

Steps to replicate the issue (include links if applicable):

What happens?:

  • image.png (181×558 px, 9 KB)
  • image.png (184×335 px, 3 KB)
  • image.png (558×747 px, 28 KB)

What should have happened instead?:

  • Query stops

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

  • Had this happen a couple times last night too
  • About an hour later, I clicked the "stop query" button and got error 500 again. but then I clicked the "submit query" button, which had not been working before, and the query finally got unstuck.
  • I also see the behavior of query status = queued, then getting stuck unless the user does something to resubmit the query

Event Timeline

I'm guessing this issue is from the threads being in separate pods. The attached PR removes the feature which may be the way to go with it.

I don't think that's the issue. We persist the db process id in the query_run table, so even a different pod is able to execute KILL <id> on the db to get the query to stop.

The issue I suspect is that *.analytics.db.svc.eqiad.wmflabs are LB endpoints behind which there could be multiple replicas (@taavi - would you be able to confirm if this is the case?). The KILL command wouldn't work if received by a different replica than the one on which query is running. This is also the reason the automated kills after the 30-minute timeout are unreliable nowadays.

The issue I suspect is that *.analytics.db.svc.eqiad.wmflabs are LB endpoints behind which there could be multiple replicas (@taavi - would you be able to confirm if this is the case?).

No, not usually. (The long answer is that there's generally one analytics and one web replica per section, and while maintenance sometimes means that we send all traffic to a single server, it should not lead to a situation where a single endpoint is routed to multiple replicas at the same time.)