Page MenuHomePhabricator

2024-09-10: hardware error on cloudvirt2004-dev
Open, MediumPublic

Description

The node had some hardware issue and failed to use the hard drive completely, rendering it unusable and needing a hard-reboot from impi.

Found out because the VMs hosted in it had stopped running (sowed unowned by horizon logs).

The node was still running, just unable to read/write to hard drive, throwing many I/O error every time you tried anything and failing to find some binaries (ex. reboot).

After hard rebooting it it came up online ok, dmesg clean, and journal shows logs only until 5min before the VMs it hosted stopped (nothing suspicious before on a first pass).

Event Timeline

dcaro triaged this task as Medium priority.Sep 10 2024, 3:22 PM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-11T07:41:28Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt2004-dev.codfw.wmnet' (T374467)

Yep, this morning it woke up with more memory corruption errors, I'm draining it waiting for the memory replacement:

[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4                                                                                                                                                               
[Wed Sep 11 03:38:30 2024] mce: Uncorrected hardware memory error in user-access at 5805fddf00                                                                                                                                                                          
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]: event severity: recoverable                                                                                                                                                                                             
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:  Error 0, type: recoverable                                                                                                                                                                                             
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:  fru_text: B7                                                                                                                                                                                                           
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   section_type: memory error                                                                                                                                                                                            
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)                                                                                                                                                      
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   physical_address: 0x0000005805fda280                                                                                                                                                                                  
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0                                                                                                                                                                             
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   node:1 card:3 module:0 rank:1 bank:14 device:0 row:48175 column:656                                                                                                                                                   
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   error_type: 3, multi-bit ECC                                                                                                                                                                                          
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000                                                                                                                                                                        
[Wed Sep 11 03:38:30 2024] mce: [Hardware Error]: Machine check events logged                                                                                                                                                                                           
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:  Error 1, type: recoverable                                                                                                                                                                                             
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:  fru_text: B7                                                                                                                                                                                                           
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   section_type: memory error                                                                                                                                                                                            
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)                                                                                                                                                      
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   physical_address: 0x0000005805fdb600                                                                                                                                                                                  
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   physical_address_mask: 0xffffffffffffffc0                                                                                                                                                                             
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   node:1 card:3 module:0 rank:1 bank:14 device:0 row:48175 column:736                                                                                                                                                   
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   error_type: 3, multi-bit ECC                                                                                                                                                                                          
[Wed Sep 11 03:38:30 2024] {3}[Hardware Error]:   DIMM location: not present. DMI handle: 0x0000                                                                                                                                                                        
[Wed Sep 11 03:38:30 2024] Memory failure: 0x5805fdd: Sending SIGBUS to puppet:323524 due to hardware memory corruption                                                                                                                                                 
[Wed Sep 11 03:38:30 2024] mce: [Hardware Error]: CPU 1: Machine Check Exception: 7 Bank 1: bd80000000100134

Mentioned in SAL (#wikimedia-cloud-feed) [2024-09-11T07:49:29Z] <wmbot~dcaro@urcuchillay> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt2004-dev.codfw.wmnet' (T374467)

The DIM has been replaced, now we have to double-check that everything looks ok and put the node back in the pool