(Note that we have disabled this server, this doesn't need to be acted on today, unless anyone happens to be present in the DC anyway)
Starting on Sunday we had seen Swift errors (also reported at https://phabricator.wikimedia.org/T382705), which could ultimatly be tracked down to frequent disk resets on ms-be2075:
[Dec23 12:11] sd 0:0:23:0: Power-on or device reset occurred [ +16.749465] sd 0:0:23:0: Power-on or device reset occurred [Dec23 12:12] sd 0:0:25:0: Power-on or device reset occurred [ +14.749575] sd 0:0:2:0: Power-on or device reset occurred [Dec23 12:13] sd 0:0:8:0: Power-on or device reset occurred [ +22.702457] sd 0:0:17:0: Power-on or device reset occurred [Dec23 12:14] sd 0:0:24:0: Power-on or device reset occurred [Dec23 12:15] sd 0:0:24:0: Power-on or device reset occurred [Dec23 12:16] sd 0:0:25:0: Power-on or device reset occurred [ +25.749153] sd 0:0:8:0: Power-on or device reset occurred [Dec23 12:17] sd 0:0:5:0: Power-on or device reset occurred [ +9.679077] sd 0:0:5:0: Power-on or device reset occurred [ +13.499560] sd 0:0:12:0: Power-on or device reset occurred [ +16.749480] sd 0:0:12:0: Power-on or device reset occurred [Dec23 12:18] sd 0:0:25:0: Power-on or device reset occurred [ +51.651277] sd 0:0:24:0: Power-on or device reset occurred [ +0.000003] sd 0:0:25:0: Power-on or device reset occurred [Dec23 12:19] sd 0:0:24:0: Power-on or device reset occurred [ +30.749023] sd 0:0:24:0: Power-on or device reset occurred [ +16.249484] sd 0:0:24:0: Power-on or device reset occurred [Dec23 12:20] sd 0:0:16:0: Power-on or device reset occurred [ +0.000005] sd 0:0:3:0: Power-on or device reset occurred [Dec23 12:21] sd 0:0:19:0: Power-on or device reset occurred [ +11.499634] sd 0:0:25:0: Power-on or device reset occurred [Dec23 12:22] sd 0:0:17:0: Power-on or device reset occurred [ +45.998619] sd 0:0:18:0: Power-on or device reset occurred [ +11.999585] sd 0:0:3:0: Power-on or device reset occurred [Dec23 12:23] sd 0:0:24:0: Power-on or device reset occurred [ +41.748701] sd 0:0:14:0: Power-on or device reset occurred [Dec23 12:24] sd 0:0:21:0: Power-on or device reset occurred [ +4.366527] sd 0:0:24:0: Power-on or device reset occurred [ +2.249937] sd 0:0:2:0: Power-on or device reset occurred [ +30.248988] sd 0:0:9:0: Power-on or device reset occurred [ +16.499510] sd 0:0:17:0: Power-on or device reset occurred [Dec23 12:26] sd 0:0:15:0: Power-on or device reset occurred [ +16.499507] sd 0:0:17:0: Power-on or device reset occurred [ +7.793269] sd 0:0:25:0: Power-on or device reset occurred [Dec23 12:28] sd 0:0:0:0: Power-on or device reset occurred [ +12.287019] sd 0:0:25:0: Power-on or device reset occurred
There are no errors flagged in SEL:
racadm>>racadm getsel Record: 1 Date/Time: 11/22/2023 19:43:45 Source: system Severity: Ok Description: Log cleared.
So it seems likely that the disks are fine and this is caused by broken power supply connections to the disks? Maybe we can start with reseating all connectors (or swapping them if we have sufficient replacments around?