Page MenuHomePhabricator

High amount of 503/504 for swift uploads
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Upload a file and receive an error
  • 04369: finalize/189> Still waiting for server to publish uploaded file
  • 04374: FAILED: stashfailed: An unknown error occurred in storage backend "local-swift-codfw".

What happens?:
Swift is returning a lot of 503s
https://grafana.wikimedia.org/goto/LtAe0rSHR?orgId=1

What should have happened instead?:
No 503s should be happening

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Screenshot 2024-12-23 at 13.43.37.png (817×1 px, 197 KB)

Previous occurrences of this are:

Event Timeline

TheDJ raised the priority of this task from High to Unbreak Now!.Dec 23 2024, 12:41 PM
BCornwall changed the task status from Open to In Progress.Dec 23 2024, 12:42 PM

Change #1106303 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Swift: Remove ms-be2075 from prod hosts

https://gerrit.wikimedia.org/r/1106303

oh, initially I posted the wrong link, and wrong screenshot, I copied from the wrong browser tab :D

This is the Grafana log for this particular incident.
https://grafana.wikimedia.org/goto/LtAe0rSHR?orgId=1

ms-be2075 has a data link reset a few times a minute, suggesting a bad cable:

[...]
Dec 23 12:42:54 ms-be2075 kernel: sd 0:0:24:0: Power-on or device reset occurred
Dec 23 12:43:13 ms-be2075 kernel: sd 0:0:17:0: Power-on or device reset occurred
Dec 23 12:43:31 ms-be2075 kernel: sd 0:0:10:0: Power-on or device reset occurred
Dec 23 12:44:08 ms-be2075 kernel: sd 0:0:25:0: Power-on or device reset occurred
Dec 23 12:44:20 ms-be2075 kernel: sd 0:0:13:0: Power-on or device reset occurred
[...]
TheDJ renamed this task from High amount of 503 for swift uploads to High amount of 503/504 for swift uploads.Dec 23 2024, 12:47 PM
TheDJ updated the task description. (Show Details)

Change #1106303 merged by BCornwall:

[operations/puppet@production] Swift: Mark ms-be2075 as failed, remove from prod

https://gerrit.wikimedia.org/r/1106303

ms-be2075 will be effectively removed from the ring (weights set to 0), but a small snag: Swift rings have an enforced minimum time between changes for data integrity reasons and the next availability for application will be at 20:15 UTC. Unfortunately, we'll need to wait.

The depool won't entirely help (writes always go to both clusters), but diverting read traffic to eqiad swift should help mitigate user impact a bit. We should restore it before US staff stop work at the end of today, though.

Ehm. it this a problem ? or a side effect of the depool taking effect after that 20:15 window ?

Screenshot 2024-12-23 at 21.08.51.png (1×2 px, 221 KB)

@TheDJ That was a result of a separate issue that is now resolved (it's been quite a day for swift!)

(it's been quite a day for swift!)

@BCornwall lets just hope then that it had to get this out of its system before Christmas and now its done :)

BCornwall claimed this task.

This should be fixed now that ms-be2075 is taken out of the ring. Thanks to @MatthewVernon for doing all the heavy lifting.