Page MenuHomePhabricator

Icinga/Persistent high iowait - labstore1004
Closed, DuplicatePublic

Description

Write the description below

Paged, from alertmanager:

Icinga/Persistent high iowait
summary: 8.155 ge 5
3 minutes ago
instance: labstore1004
team: wmcs
@cluster: wikimedia.org

Event Timeline

dcaro triaged this task as High priority.Sep 3 2021, 5:07 AM
dcaro created this task.

DRBD seems up to date:

root@labstore1004:~# cat /proc/drbd
version: 8.4.7 (api:1/proto:86-101)
srcversion: F7D2F0C9036CD0E796D5958

 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r-----
    ns:0 nr:0 dw:16 dr:8859890 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r-----
    ns:1130241492 nr:0 dw:1366066536 dr:974700686 al:2416751 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate B r-----
    ns:977540684 nr:0 dw:1472572156 dr:1075083826 al:91779529 bm:0 lo:25 pe:0 ua:0 ap:25 ep:1 wo:f oos:0

Though there's lots of errors like this in dmesg:

[Fri Sep  3 05:13:27 2021] block drbd4: Remote failed to finish a request within 42240ms > ko-count (7) * timeout (60 * 0.1s)
[Fri Sep  3 05:13:27 2021] drbd tools: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
[Fri Sep  3 05:13:27 2021] block drbd4: new current UUID 77FE0AE3F4F3C5FF:A6D38D85296346F3:35621C544D8B2025:35611C544D8B2025
[Fri Sep  3 05:13:27 2021] drbd tools: ack_receiver terminated
[Fri Sep  3 05:13:27 2021] drbd tools: Terminating drbd_a_tools
[Fri Sep  3 05:13:27 2021] drbd tools: Connection closed
[Fri Sep  3 05:13:27 2021] drbd tools: conn( Timeout -> Unconnected )
[Fri Sep  3 05:13:27 2021] drbd tools: receiver terminated
[Fri Sep  3 05:13:27 2021] drbd tools: Restarting receiver thread
[Fri Sep  3 05:13:27 2021] drbd tools: receiver (re)started
[Fri Sep  3 05:13:27 2021] drbd tools: helper command: /sbin/drbdadm fence-peer tools
[Fri Sep  3 05:13:27 2021] drbd tools: conn( Unconnected -> WFConnection )
[Fri Sep  3 05:13:27 2021] drbd tools: helper command: /sbin/drbdadm fence-peer tools exit code 0 (0x0)
[Fri Sep  3 05:13:27 2021] drbd tools: fence-peer helper broken, returned 0
[Fri Sep  3 05:13:43 2021] drbd tools: Handshake successful: Agreed network protocol version 101
[Fri Sep  3 05:13:43 2021] drbd tools: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
[Fri Sep  3 05:13:43 2021] drbd tools: Peer authenticated using 20 bytes HMAC
[Fri Sep  3 05:13:43 2021] drbd tools: conn( WFConnection -> WFReportParams )
[Fri Sep  3 05:13:43 2021] drbd tools: Starting ack_recv thread (from drbd_r_tools [52541])
[Fri Sep  3 05:13:43 2021] block drbd4: drbd_sync_handshake:
[Fri Sep  3 05:13:43 2021] block drbd4: self 77FE0AE3F4F3C5FF:A6D38D85296346F3:35621C544D8B2025:35611C544D8B2025 bits:24190 flags:0
[Fri Sep  3 05:13:43 2021] block drbd4: peer A6D38D85296346F2:0000000000000000:35621C544D8B2024:35611C544D8B2025 bits:0 flags:0
[Fri Sep  3 05:13:43 2021] block drbd4: uuid_compare()=1 by rule 70
[Fri Sep  3 05:13:43 2021] block drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
[Fri Sep  3 05:13:43 2021] block drbd4: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 6324(2), total 6324; compression: 100.0%
[Fri Sep  3 05:13:43 2021] block drbd4: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 6324(2), total 6324; compression: 100.0%
[Fri Sep  3 05:13:43 2021] block drbd4: helper command: /sbin/drbdadm before-resync-source minor-4
[Fri Sep  3 05:13:44 2021] block drbd4: helper command: /sbin/drbdadm before-resync-source minor-4 exit code 0 (0x0)
[Fri Sep  3 05:13:44 2021] block drbd4: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
[Fri Sep  3 05:13:44 2021] block drbd4: Began resync as SyncSource (will sync 96812 KB [24203 bits set]).
[Fri Sep  3 05:13:44 2021] block drbd4: updated sync UUID 77FE0AE3F4F3C5FF:A6D48D85296346F3:A6D38D85296346F3:35621C544D8B2025
[Fri Sep  3 05:13:50 2021] block drbd4: Resync done (total 6 sec; paused 0 sec; 16132 K/sec)
[Fri Sep  3 05:13:50 2021] block drbd4: 2 % had equal checksums, eliminated: 2472K; transferred 94340K total 96812K
[Fri Sep  3 05:13:50 2021] block drbd4: updated UUIDs 77FE0AE3F4F3C5FF:0000000000000000:A6D48D85296346F3:A6D38D85296346F3
[Fri Sep  3 05:13:50 2021] block drbd4: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
[Fri Sep  3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools
[Fri Sep  3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools exit code 0 (0x0)

It might be relevant also that the time and the kernel time are different (dmesg entries are in the future):

root@labstore1004:~# dmesg -T
...
[Fri Sep  3 05:13:50 2021] drbd tools: helper command: /sbin/drbdadm unfence-peer tools exit code 0 (0x0)

root@labstore1004:~# date
Fri Sep  3 05:05:27 UTC 2021

From the graphs (https://grafana-rw.wikimedia.org/d/000000568/labstore1004-1005-1006-1007?orgId=1) it seems that labstore1005, the pair for drbd4:

root@labstore1004:~# grep -B 1 drbd4 /etc/drbd.d/*
/etc/drbd.d/tools.res-  on labstore1004 {
/etc/drbd.d/tools.res:    device    /dev/drbd4;
--
/etc/drbd.d/tools.res-  on labstore1005 {
/etc/drbd.d/tools.res:    device    /dev/drbd4;

Had some trouble at ~02:30 UTC. From the journalctl on labstore1004 that's when the above errors began.
Looking labstore1005

On labstore1005 side:

Sep 03 02:29:42 labstore1005 kernel: drbd tools: meta connection shut down by peer.
Sep 03 02:29:42 labstore1005 kernel: drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 03 02:29:52 labstore1005 kernel: drbd misc: meta connection shut down by peer.
Sep 03 02:29:52 labstore1005 kernel: drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )

On dmesg there's many suspicious logs about the scisi driver, though those started yesterday, it might be related:

[Thu Sep  2 09:06:28 2021] drbd tools: meta connection shut down by peer.
[Thu Sep  2 09:06:28 2021] drbd tools: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[Thu Sep  2 09:06:29 2021] megaraid_sas 0000:03:00.0: Iop2SysDoorbellIntfor scsi0
[Thu Sep  2 09:06:30 2021] megaraid_sas 0000:03:00.0: Found FW in FAULT state, will reset adapter scsi0.
[Thu Sep  2 09:06:30 2021] megaraid_sas 0000:03:00.0: resetting fusion adapter scsi0.
[Thu Sep  2 09:06:31 2021] drbd misc: meta connection shut down by peer.
[Thu Sep  2 09:06:31 2021] drbd misc: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
[Thu Sep  2 09:06:40 2021] megaraid_sas 0000:03:00.0: Waiting for FW to come to ready state
[Thu Sep  2 09:06:49 2021] megaraid_sas 0000:03:00.0: FW now in Ready state
[Thu Sep  2 09:06:49 2021] megaraid_sas 0000:03:00.0: Current firmware maximum commands: 928     LDIO threshold: 0
[Thu Sep  2 09:06:49 2021] megaraid_sas 0000:03:00.0: FW supports sync cache    : No
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: Init cmd success
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: firmware type     : Extended VD(240 VD)firmware
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: controller type   : MR(1024MB)
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR)      : Enabled
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: Secure JBOD support       : No
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: Jbod map is not supported megasas_setup_jbod_map 4980
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: Reset successful for scsi0.
[Thu Sep  2 09:06:50 2021] megaraid_sas 0000:03:00.0: 67083 (683889164s/0x0020/CRIT) - Controller encountered a fatal error and was reset

it seems that the hardware is starting to fail.