Write the description below
Paged, from alertmanager:
Icinga/Persistent high iowait
summary: 8.155 ge 5
3 minutes ago
instance: labstore1004
team: wmcs
@cluster: wikimedia.org
Paged, from alertmanager:
Icinga/Persistent high iowait
summary: 8.155 ge 5
3 minutes ago
instance: labstore1004
team: wmcs
@cluster: wikimedia.org
DRBD seems up to date:
root@labstore1004:~# cat /proc/drbd version: 8.4.7 (api:1/proto:86-101) srcversion: F7D2F0C9036CD0E796D5958 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r----- ns:0 nr:0 dw:16 dr:8859890 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r----- ns:1130241492 nr:0 dw:1366066536 dr:974700686 al:2416751 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r----- ns:977540684 nr:0 dw:1472572156 dr:1075083826 al:91779529 bm:0 lo:25 pe:0 ua:0 ap:25 ep:1 wo:f oos:0
Though there's lots of errors like this in dmesg:
[Fri Sep 3 05:13:27 2021] block drbd4: Remote failed to finish a request within 42240ms > ko-count (7) * timeout (60 * 0.1s) [Fri Sep 3 05:13:27 2021] drbd tools: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) [Fri Sep 3 05:13:27 2021] block drbd4: new current UUID 77FE0AE3F4F3C5FF:A6D38D85296346F3:35621C544D8B2025:35611C544D8B2025 [Fri Sep 3 05:13:27 2021] drbd tools: ack_receiver terminated [Fri Sep 3 05:13:27 2021] drbd tools: Terminating drbd_a_tools [Fri Sep 3 05:13:27 2021] drbd tools: Connection closed [Fri Sep 3 05:13:27 2021] drbd tools: conn( Timeout -> Unconnected ) [Fri Sep 3 05:13:27 2021] drbd tools: receiver terminated [Fri Sep 3 05:13:27 2021] drbd tools: Restarting receiver thread [Fri Sep 3 05:13:27 2021] drbd tools: receiver (re)started [Fri Sep 3 05:13:27 2021] drbd tools: helper command: /sbin/drbdadm fence-peer tools [Fri Sep 3 05:13:27 2021] drbd tools: conn( Unconnected -> WFConnection ) [Fri Sep 3 05:13:27 2021] drbd tools: helper command: /sbin/drbdadm fence-peer tools exit code 0 (0x0) [Fri Sep 3 05:13:27 2021] drbd tools: fence-peer helper broken, returned 0 [Fri Sep 3 05:13:43 2021] drbd tools: Handshake successful: Agreed network protocol version 101 [Fri Sep 3 05:13:43 2021] drbd tools: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME. [Fri Sep 3 05:13:43 2021] drbd tools: Peer authenticated using 20 bytes HMAC [Fri Sep 3 05:13:43 2021] drbd tools: conn( WFConnection -> WFReportParams ) [Fri Sep 3 05:13:43 2021] drbd tools: Starting ack_recv thread (from drbd_r_tools [52541]) [Fri Sep 3 05:13:43 2021] block drbd4: drbd_sync_handshake: [Fri Sep 3 05:13:43 2021] block drbd4: self 77FE0AE3F4F3C5FF:A6D38D85296346F3:35621C544D8B2025:35611C544D8B2025 bits:24190 flags:0 [Fri Sep 3 05:13:43 2021] block drbd4: peer A6D38D85296346F2:0000000000000000:35621C544D8B2024:35611C544D8B2025 bits:0 flags:0 [Fri Sep 3 05:13:43 2021] block drbd4: uuid_compare()=1 by rule 70 [Fri Sep 3 05:13:43 2021] block drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) [Fri Sep 3 05:13:43 2021] block drbd4: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 6324(2), total 6324; compression: 100.0% [Fri Sep 3 05:13:43 2021] block drbd4: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 6324(2), total 6324; compression: 100.0% [Fri Sep 3 05:13:43 2021] block drbd4: helper command: /sbin/drbdadm before-resync-source minor-4 [Fri Sep 3 05:13:44 2021] block drbd4: helper command: /sbin/drbdadm before-resync-source minor-4 exit code 0 (0x0) [Fri Sep 3 05:13:44 2021] block drbd4: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) [Fri Sep 3 05:13:44 2021] block drbd4: Began resync as SyncSource (will sync 96812 KB [24203 bits set]). [Fri Sep 3 05:13:44 2021] block drbd4: updated sync UUID 77FE0AE3F4F3C5FF:A6D48D85296346F3:A6D38D85296346F3:35621C544D8B2025 [Fri Sep 3 05:13:50 2021] block drbd4: Resync done (total 6 sec; paused 0 sec; 16132 K/sec) [Fri Sep 3 05:13:50 2021] block drbd4: 2 % had equal checksums, eliminated: 2472K; transferred 94340K total 96812K [Fri Sep 3 05:13:50 2021] block drbd4: updated UUIDs 77FE0AE3F4F3C5FF:0000000000000000:A6D48D85296346F3:A6D38D85296346F3 [Fri Sep 3 05:13:50 2021] block drbd4: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) [Fri Sep 3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools [Fri Sep 3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools exit code 0 (0x0)
It might be relevant also that the time and the kernel time are different (dmesg entries are in the future):
root@labstore1004:~# dmesg -T ... [Fri Sep 3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools exit code 0 (0x0) root@labstore1004:~# date Fri Sep 3 05:05:27 UTC 2021
From the graphs (https://grafana-rw.wikimedia.org/d/000000568/labstore1004-1005-1006-1007?orgId=1) it seems that labstore1005, the pair for drbd4:
root@labstore1004:~# grep -B 1 drbd4 /etc/drbd.d/* /etc/drbd.d/tools.res- on labstore1004 { /etc/drbd.d/tools.res: device /dev/drbd4; -- /etc/drbd.d/tools.res- on labstore1005 { /etc/drbd.d/tools.res: device /dev/drbd4;
Had some trouble at ~02:30 UTC. From the journalctl on labstore1004 that's when the above errors began.
Looking labstore1005
On labstore1005 side:
Sep 03 02:29:42 labstore1005 kernel: drbd tools: meta connection shut down by peer. Sep 03 02:29:42 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Sep 03 02:29:52 labstore1005 kernel: drbd misc: meta connection shut down by peer. Sep 03 02:29:52 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
On dmesg there's many suspicious logs about the scisi driver, though those started yesterday, it might be related:
[Thu Sep 2 09:06:28 2021] drbd tools: meta connection shut down by peer. [Thu Sep 2 09:06:28 2021] drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [Thu Sep 2 09:06:29 2021] megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0 [Thu Sep 2 09:06:30 2021] megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0. [Thu Sep 2 09:06:30 2021] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0. [Thu Sep 2 09:06:31 2021] drbd misc: meta connection shut down by peer. [Thu Sep 2 09:06:31 2021] drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) [Thu Sep 2 09:06:40 2021] megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state [Thu Sep 2 09:06:49 2021] megaraid_sas 0000:03:00.0: FW now in Ready state [Thu Sep 2 09:06:49 2021] megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928 LDIO threshold: 0 [Thu Sep 2 09:06:49 2021] megaraid_sas 0000:03:00.0: FW supports sync cache : No [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: Init cmd success [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: firmware type : Extended VD(240 VD)firmware [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: controller type : MR(1024MB) [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR) : Enabled [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: Secure JBOD support : No [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4980 [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: Reset successful for scsi0. [Thu Sep 2 09:06:50 2021] megaraid_sas 0000:03:00.0: 67083 (683889164s/0x0020/CRIT) - Controller encountered a fatal error and was reset
it seems that the hardware is starting to fail.