Page MenuHomePhabricator

Quarry is degraded/partially inaccessible
Closed, ResolvedPublicBUG REPORT

Description

  • https://quarry.wmcloud.org/query/runs/all returns a 500 Internal Server Error (HAR).
  • Some queries execute normally while others are stuck in "queued". I've found no pattern with whether or not queries will run or not. When attempting to stop a queued query, the same response as in the HAR above will be returned.

Related Objects

StatusSubtypeAssignedTask
ResolvedBUG REPORTAndrew
ResolvedBstorm

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is probably fixed by https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/716793 but it might be only partial -- can you retest?

I tried executing four queries to test the changes (56472, 48083, 55025, and 52754). 52754 remains stuck in "queued" and 48083 is stuck in "running" after 20 minutes (the query normally completes in 70 seconds). It seems like some other queries (see https://quarry.wmcloud.org/query/runs/all) are also getting stuck in "queued".

I'm sorry this is misbehaving. I just tried re-running one of your queries and it worked:

https://quarry.wmcloud.org/query/58317

This has me still thinking that this is some kind of orphan/stuck condition for those exact queries rather than a general breakage. I'll see if I can get them unstuck!

Current theory is that this is related to database timeouts, which we adjust shortly

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T16:45:02Z] <bstorm> set live wait_timeout variable to 28800 (the default) on the trove instance T290291

Hello again @Chlod . We've adjusted some timeouts which were probably the cause of the queued/running-forever issue. Since those are orphaned queries now they will probably not update their states automatically but if you re-run them I expect things to work.

Works like a charm. Thanks, @Andrew!