Page MenuHomePhabricator

Quarry down - web service unreachable
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • I try to run my queries on Quarry and get an error message. It's been like this for several hours. I posted about it on En.wiki WP:VPT but no response there.

What happens?:
I get this message:

Error
This web service cannot be reached. Please contact a maintainer of this project.

Maintainers can find troubleshooting instructions from our documentation on Wikitech.

proxy-5.project-proxy.eqiad1.wikimedia.cloud

What should have happened instead?:
Queries should have run normally

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):
Chrome, PC, ordinary system

Related Objects

Event Timeline

RhinosF1 renamed this task from Is Quarry down? to Quarry down - web service unreachable.May 25 2025, 7:09 AM

04:32:28 <wmcs-alerts> FIRING: [2x] TargetDown: Job app is unreachable in project quarry instance quarry.wmcloud.org:443  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DTargetDown
04:32:39 <wmcs-alerts> FIRING: QuarryDown: Quarry application is unreachable  - https://prometheus-alerts.wmcloud.org/?q=alertname%3DQuarryDown

So it's been down just over 3.5 hours

taavi subscribed.

One of the workers froze and took down the Redis instance with it, rebooting that worker solved it. Leaving this open to figure out on Monday why that happened.

Mentioned in SAL (#wikimedia-cloud) [2025-05-25T07:56:28Z] <taavi> reboot quarry-127b-3lqizumia4xn-node-1 T395201

Filed subtasks for the issues I've found. Closing.

Looks like this is resolved. Many thanks!

It's happening again! The exact same thing is happening again.

Okay, this glitch got fixed late last night but now it's down again! What is the problem here?

Edit: @taavi, can you look at this?

Okay, this glitch got fixed late last night but now it's down again! What is the problem here?

Edit: @taavi, can you look at this?

Just manually restarted redis (that was in a 'completed' state) and the stuck web pods (one was complaining of exceeding tmp usage, the onther complaining of redis being down). Things seem to be back up for now.

Okay, it's working again. Let's hope this doesn't happen again tomorrow. Thanks!