Page MenuHomePhabricator

db2068 storage crash
Closed, ResolvedPublic

Description

Can't log-in via ssh:

ssh db2068.codfw.wmnet
-bash: /etc/profile: Input/output error
-bash: /usr/bin/tput: Input/output error
[10075532.509947] sd 0:1:0:0: rejecting I/O to offline device
[10075532.537719] sd 0:1:0:0: rejecting I/O to offline device
db2068 login:
[10075537.246063] sd 0:1:0:0: rejecting I/O to offline device
db2068 login:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2017, 6:12 AM

Mentioned in SAL (#wikimedia-operations) [2017-11-20T06:15:56Z] <marostegui> Reboot db2068 - T180927

From the syslog servers:

Nov 20 00:44:05 db2068 kernel: [10055954.989428] hpsa 0000:02:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
Nov 20 00:44:20 db2068 kernel: [10055970.157686] hpsa 0000:02:00.0: Controller lockup detected: 0x00140000 after 30
Nov 20 00:44:20 db2068 kernel: [10055970.157726] hpsa 0000:02:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000
Nov 20 00:44:20 db2068 kernel: [10055970.157728] hpsa 0000:02:00.0: Controller lockup detected during reset wait
Nov 20 00:44:20 db2068 kernel: [10055970.157731] hpsa 0000:02:00.0: scsi 0:1:0:0: reset logical  failed Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1
Nov 20 00:44:20 db2068 kernel: [10055970.157740] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157742] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157743] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157744] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157746] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157747] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157747] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157748] hpsa 0000:02:00.0: failed 62 commands in fail_all
Nov 20 00:44:20 db2068 kernel: [10055970.157749] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157750] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157751] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157752] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157753] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157753] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157754] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157755] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157756] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157756] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157757] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157758] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157759] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157759] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157760] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157761] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157761] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157762] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157763] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157764] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157764] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157765] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157766] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157767] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157767] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157768] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157769] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157770] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157770] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157771] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157772] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157773] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157774] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157774] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157775] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157776] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157777] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157778] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157779] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157779] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157780] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157781] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157782] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157782] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157783] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157784] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157784] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157785] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157786] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157787] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157787] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157788] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157789] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157790] sd 0:1:0:0: Device offlined - not ready after error recovery
Nov 20 00:44:20 db2068 kernel: [10055970.157797] sd 0:1:0:0: [sda] tag#60 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.157801] sd 0:1:0:0: [sda] tag#60 CDB: Write(16) 8a 00 00 00 00 00 02 56 4d 68 00 00 00 08 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.157803] blk_update_request: I/O error, dev sda, sector 39210344
Nov 20 00:44:20 db2068 kernel: [10055970.187271] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1444647 (offset 0 size 0 starting block 4901294)
Nov 20 00:44:20 db2068 kernel: [10055970.187274] Buffer I/O error on device sda1, logical block 4900909
Nov 20 00:44:20 db2068 kernel: [10055970.216484] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.241156] sd 0:1:0:0: [sda] killing request
Nov 20 00:44:20 db2068 kernel: [10055970.241159] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.265954] sd 0:1:0:0: [sda] tag#59 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.265955] sd 0:1:0:0: [sda] tag#59 CDB: Write(16) 8a 00 00 00 00 00 01 e9 7f f0 00 00 00 08 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.265956] blk_update_request: I/O error, dev sda, sector 32079856
Nov 20 00:44:20 db2068 kernel: [10055970.266044] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.266108] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.266172] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.266231] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:20 db2068 kernel: [10055970.394359] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1446123 (offset 0 size 0 starting block 4009983)
Nov 20 00:44:20 db2068 kernel: [10055970.394362] Buffer I/O error on device sda1, logical block 4009598
Nov 20 00:44:20 db2068 kernel: [10055970.423731] sd 0:1:0:0: [sda] tag#58 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.423732] sd 0:1:0:0: [sda] tag#58 CDB: Write(16) 8a 00 00 00 00 00 01 7c c3 d8 00 00 00 08 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.423732] blk_update_request: I/O error, dev sda, sector 24953816
Nov 20 00:44:20 db2068 kernel: [10055970.453148] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1444953 (offset 0 size 0 starting block 3119228)
Nov 20 00:44:20 db2068 kernel: [10055970.453163] Buffer I/O error on device sda1, logical block 3118843
Nov 20 00:44:20 db2068 kernel: [10055970.482250] sd 0:1:0:0: [sda] tag#57 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.482251] sd 0:1:0:0: [sda] tag#57 CDB: Write(16) 8a 00 00 00 00 00 01 6c 51 f0 00 00 00 08 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.482251] blk_update_request: I/O error, dev sda, sector 23876080
Nov 20 00:44:20 db2068 kernel: [10055970.511787] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1444623 (offset 0 size 0 starting block 2984511)
Nov 20 00:44:20 db2068 kernel: [10055970.511789] Buffer I/O error on device sda1, logical block 2984126
Nov 20 00:44:20 db2068 kernel: [10055970.540868] sd 0:1:0:0: [sda] tag#56 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.540869] sd 0:1:0:0: [sda] tag#56 CDB: Write(16) 8a 00 00 00 00 00 e3 9a 0a 10 00 00 00 10 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.540869] blk_update_request: I/O error, dev sda, sector 3818523152
Nov 20 00:44:20 db2068 kernel: [10055970.571249] sd 0:1:0:0: [sda] tag#55 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.571250] sd 0:1:0:0: [sda] tag#55 CDB: Write(16) 8a 00 00 00 00 00 db e1 44 40 00 00 00 20 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.571250] blk_update_request: I/O error, dev sda, sector 3688973376
Nov 20 00:44:20 db2068 kernel: [10055970.601571] sd 0:1:0:0: [sda] tag#54 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.601571] sd 0:1:0:0: [sda] tag#54 CDB: Write(16) 8a 00 00 00 00 00 db e1 44 20 00 00 00 10 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.601573] blk_update_request: I/O error, dev sda, sector 3688973344
Nov 20 00:44:20 db2068 kernel: [10055970.631973] sd 0:1:0:0: [sda] tag#53 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.631974] sd 0:1:0:0: [sda] tag#53 CDB: Write(16) 8a 00 00 00 00 00 db e1 44 00 00 00 00 10 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.631975] blk_update_request: I/O error, dev sda, sector 3688973312
Nov 20 00:44:20 db2068 kernel: [10055970.662570] sd 0:1:0:0: [sda] tag#52 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.662571] sd 0:1:0:0: [sda] tag#52 CDB: Write(16) 8a 00 00 00 00 00 d4 93 5e 00 00 00 00 10 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.662572] blk_update_request: I/O error, dev sda, sector 3566427648
Nov 20 00:44:20 db2068 kernel: [10055970.692808] sd 0:1:0:0: [sda] tag#51 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Nov 20 00:44:20 db2068 kernel: [10055970.692809] sd 0:1:0:0: [sda] tag#51 CDB: Write(16) 8a 00 00 00 00 00 cf a6 f0 00 00 00 00 10 00 00
Nov 20 00:44:20 db2068 kernel: [10055970.692809] blk_update_request: I/O error, dev sda, sector 3483824128
Nov 20 00:44:20 db2068 kernel: [10055970.723833] Buffer I/O error on dev dm-0, logical block 514972388, lost async page write
Nov 20 00:44:20 db2068 kernel: [10055970.761836] Buffer I/O error on dev dm-0, logical block 514972384, lost async page write
Nov 20 00:44:20 db2068 kernel: [10055970.799846] Buffer I/O error on dev dm-0, logical block 82135950, lost async page write
Nov 20 00:44:20 db2068 kernel: [10055970.837345] Buffer I/O error on dev sda1, logical block 5877442, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055970.874223] Buffer I/O error on dev sda1, logical block 5866077, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055970.911242] Buffer I/O error on dev sda1, logical block 5865498, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055970.948297] Buffer I/O error on dev sda1, logical block 5865477, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055970.985984] Buffer I/O error on dev sda1, logical block 5767189, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055971.024397] Buffer I/O error on dev sda1, logical block 5767168, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055971.062549] Buffer I/O error on dev sda1, logical block 4892481, lost async page write
Nov 20 00:44:21 db2068 kernel: [10055971.101622] EXT4-fs error (device sda1): ext4_find_entry:1463: inode #443233: comm cron: reading directory lblock 0
Nov 20 00:44:21 db2068 kernel: [10055971.101635] XFS (dm-0): metadata I/O error: block 0xde038210 ("xfs_buf_iodone_callback_error") error 5 numblks 16
Nov 20 00:44:21 db2068 kernel: [10055971.101650] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101672] Aborting journal on device sda1-8.
Nov 20 00:44:23 db2068 kernel: [10055971.101705] EXT4-fs error (device sda1) in ext4_reserve_inode_write:5418: Journal has aborted
Nov 20 00:44:23 db2068 kernel: [10055971.101711] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101715] EXT4-fs (sda1): previous I/O error to superblock detected
Nov 20 00:44:23 db2068 kernel: [10055971.101718] JBD2: Error -5 detected when updating journal superblock for sda1-8.
Nov 20 00:44:23 db2068 kernel: [10055971.101726] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101729] EXT4-fs (sda1): Remounting filesystem read-only
Nov 20 00:44:23 db2068 kernel: [10055971.101777] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101780] EXT4-fs warning (device sda1): ext4_end_bio:314: I/O error -5 writing to inode 1446057 (offset 0 size 0 starting block 7710845)
Nov 20 00:44:23 db2068 kernel: [10055971.101781] Buffer I/O error on device sda1, logical block 7710460
Nov 20 00:44:23 db2068 kernel: [10055971.101801] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101802] JBD2: Detected IO errors while flushing file data on sda1-8
Nov 20 00:44:23 db2068 kernel: [10055971.101809] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101815] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101822] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101829] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101835] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101841] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101845] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101850] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101854] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101858] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101862] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101867] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101871] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101875] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101879] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101883] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101889] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101891] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101893] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101894] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101898] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101908] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101912] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101916] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101920] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101924] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101928] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101932] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101936] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101941] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101945] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101949] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101966] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101970] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101973] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101977] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101981] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101985] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101988] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101991] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101995] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.101999] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102003] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102007] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102011] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102015] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102018] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102022] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102026] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102030] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102039] XFS (dm-0): metadata I/O error: block 0x2be8c800 ("xfs_trans_read_buf_map") error 5 numblks 8
Nov 20 00:44:23 db2068 kernel: [10055971.102041] XFS (dm-0): xfs_do_force_shutdown(0x1) called from line 315 of file /build/linux-OExn4L/linux-4.9.30/fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffc09065c5
Nov 20 00:44:23 db2068 kernel: [10055971.102052] XFS (dm-0): metadata I/O error: block 0xcef52e00 ("xlog_iodone") error 5 numblks 512
Nov 20 00:44:23 db2068 kernel: [10055971.102054] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1200 of file /build/linux-OExn4L/linux-4.9.30/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08f5b66
Nov 20 00:44:23 db2068 kernel: [10055971.102082] XFS (dm-0): Log I/O Error Detected.  Shutting down filesystem
Nov 20 00:44:23 db2068 kernel: [10055971.102083] XFS (dm-0): Please umount the filesystem and rectify the problem(s)
Nov 20 00:44:23 db2068 kernel: [10055971.102087] XFS (dm-0): metadata I/O error: block 0xcef53000 ("xlog_iodone") error 5 numblks 512
Nov 20 00:44:23 db2068 kernel: [10055971.102087] XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1200 of file /build/linux-OExn4L/linux-4.9.30/fs/xfs/xfs_log.c.  Return address = 0xffffffffc08f5b66
Nov 20 00:44:23 db2068 kernel: [10055971.102308] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.102314] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.106616] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.106622] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.111263] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.143399] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.235546] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055971.317403] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:23 db2068 kernel: [10055973.753263] EXT4-fs (sda1): previous I/O error to superblock detected
Nov 20 00:44:23 db2068 kernel: [10055973.792499] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:24 db2068 kernel: [10055973.826897] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:24 db2068 kernel: [10055973.857363] EXT4-fs error (device sda1): ext4_read_inode_bitmap:223: comm cron: Cannot read inode bitmap - block_group = 54, inode_bitmap = 1773442
Nov 20 00:44:24 db2068 kernel: [10055973.929626] EXT4-fs (sda1): previous I/O error to superblock detected
Nov 20 00:44:24 db2068 kernel: [10055973.965457] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:24 db2068 kernel: [10055974.568052] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:24 einsteinium icinga: SERVICE ALERT: db2068;MariaDB disk space;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Nov 20 00:44:24 db2068 kernel: [10055974.596529] EXT4-fs warning (device sda1): htree_dirblock_to_tree:962: inode #1469664: lblock 0: comm prometheus-node: error -5 reading directory block
Nov 20 00:44:25 db2068 kernel: [10055974.893447] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055974.919220] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055974.945258] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055974.970513] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 einsteinium icinga: SERVICE ALERT: db2068;Check whether ferm is active by checking the default input chain;OK;SOFT;2;OK ferm input default policy is set
Nov 20 00:44:25 db2068 kernel: [10055974.995963] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.021460] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.046314] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.071786] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.098470] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.124391] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.149885] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.174866] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.200280] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.394448] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:25 db2068 kernel: [10055975.643833] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:26 db2068 kernel: [10055975.893418] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:26 db2068 kernel: [10055976.143704] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:28 tegmen icinga: SERVICE ALERT: db2068;MariaDB Slave Lag: s7;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Nov 20 00:44:28 tegmen icinga: SERVICE ALERT: db2068;Disk space;UNKNOWN;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Nov 20 00:44:30 db2068 kernel: [10055980.064791] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.092570] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.120089] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.147507] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.177476] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.208844] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.239763] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.271876] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.303931] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.644380] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.670953] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.698545] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:30 db2068 kernel: [10055980.805219] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:31 db2068 kernel: [10055981.143889] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:31 db2068 kernel: [10055981.393932] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:44:31 db2068 kernel: [10055981.644061] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:46:01 db2068 kernel: [10056071.260407] sd 0:1:0:0: rejecting I/O to offline device
Nov 20 00:46:01 db2068 kernel: [10056071.285196] EXT4-fs error (device sda1): ext4_find_entry:1463: inode #443233: comm cron: reading directory lblock 0
Nov 20 00:46:01 db2068 kernel: [10056071.333952] EXT4-fs (sda1): previous I/O error to superblock detected
Nov 20 00:46:01 db2068 kernel: [10056071.364190] sd 0:1:0:0: rejecting I/O to offline device
<snip>
Marostegui closed this task as Resolved.Nov 20 2017, 6:27 AM
Marostegui claimed this task.

A reboot fixed it, MySQL started fine and it is now catching up.
RAID looks fine also.
Closing this for now, if this happens again we'll probably need a RAID controller replacement.

jcrespo reopened this task as Open.Nov 20 2017, 10:59 AM
jcrespo added a project: ops-codfw.
jcrespo added a subscriber: jcrespo.

Maybe related: T102236

We need a BIOS upgrade and the HW logs.

Restricted Application added a project: Operations. · View Herald TranscriptNov 20 2017, 10:59 AM
Marostegui reassigned this task from Marostegui to Papaul.Nov 20 2017, 11:01 AM
Marostegui added a subscriber: Papaul.

@Papaul can you help us with the BIOS upgrade?
@jcrespo there were no HW logs from the crash, there are only the typical ones AFTER the crash that doesn't say much

</system1/log1>hpiLO-> show record9

status=0
status_tag=COMMAND COMPLETED
Mon Nov 20 11:01:00 2017



/system1/log1/record9
  Targets
  Properties
    number=9
    severity=Caution
    date=11/20/2017
    time=06:19
    description=POST Error: 1719-A controller failure event occurred prior to this power-up
  Verbs
    cd version exit show set


</system1/log1>hpiLO-> show record10

status=0
status_tag=COMMAND COMPLETED
Mon Nov 20 11:01:02 2017



/system1/log1/record10
  Targets
  Properties
    number=10
    severity=Caution
    date=11/20/2017
    time=06:19
    description=POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.
  Verbs
    cd version exit show set

From the "health log":

4	 Critical	Drive Array	11/20/2017 00:33	06/10/2015 16:05	2	Drive Array Controller Failure (Slot 0)
Marostegui moved this task from Triage to In progress on the DBA board.Nov 20 2017, 1:04 PM

Change 392431 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2068 for maintenance

https://gerrit.wikimedia.org/r/392431

Change 392431 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2068 for maintenance

https://gerrit.wikimedia.org/r/392431

Mentioned in SAL (#wikimedia-operations) [2017-11-20T15:45:08Z] <jynus> shutting down db2068 for maintenance after depool T180927

The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, there is a new Service pack ISO on the HP site that I need to download and the file is about 6.57GB and i am on MIFI. I will download the file once home and update the system tomorrow.
http://h17007.www1.hpe.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx?version=2017.04.0

Change 392609 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2068.yaml: Update socket location

https://gerrit.wikimedia.org/r/392609

Change 392609 merged by Marostegui:
[operations/puppet@production] db2068.yaml: Update socket location

https://gerrit.wikimedia.org/r/392609

The ILO is up to date. I need to update the Storage and BIOS on the system but the Service pack disk that i have is old, there is a new Service pack ISO on the HP site that I need to download and the file is about 6.57GB and i am on MIFI. I will download the file once home and update the system tomorrow.
http://h17007.www1.hpe.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx?version=2017.04.0

Sounds good, as MySQL is down since yesterday. I will turn off the system now so you can do it without being blocked on us to turn it off when you arrive on-iste
Thanks!

Mentioned in SAL (#wikimedia-operations) [2017-11-21T09:28:15Z] <marostegui> Shutdown db2068 for maintenance - T180927

Papaul reassigned this task from Papaul to Marostegui.Nov 21 2017, 4:53 PM

Firmware update complete

Thanks @Papaul - I will start mysql, let it run for the night and if all goes fine close this.
If this breaks again, we can contact the vendor and see how to proceed.

Marostegui closed this task as Resolved.Nov 22 2017, 6:55 AM

Change 391835 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Setup s8 replica set on codfw

https://gerrit.wikimedia.org/r/391835

Change 391835 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Setup s8 replica set on codfw

https://gerrit.wikimedia.org/r/391835