When I walked out of eqsin I noticed that cp5006/5007 had an amber blinking light that wasn't there when I walked in.
I checked all cables and Icinga but couldn't find anything wrong.
Looking at dmesg they both have:
ayounsi@cp5007:~$ sudo dmesg | grep Error [ 3.503337] ERST: Error Record Serialization Table (ERST) support is initialized. [4929877.901512] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [4929877.901515] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [4929877.901516] {1}[Hardware Error]: event severity: corrected [4929877.901518] {1}[Hardware Error]: Error 0, type: corrected [4929877.901519] {1}[Hardware Error]: fru_text: A5 [4929877.901520] {1}[Hardware Error]: section_type: memory error [4929877.901521] {1}[Hardware Error]: error_status: 0x0000000000000400 [4929877.901522] {1}[Hardware Error]: physical_address: 0x0000002b1d8edfc0 [4929877.901526] {1}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22987 column: 888 [4929877.901527] {1}[Hardware Error]: error_type: 2, single-bit ECC [4929915.523101] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [4929915.523104] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [4929915.523105] {2}[Hardware Error]: event severity: corrected [4929915.523107] {2}[Hardware Error]: Error 0, type: corrected [4929915.523108] {2}[Hardware Error]: fru_text: A5 [4929915.523109] {2}[Hardware Error]: section_type: memory error [4929915.523110] {2}[Hardware Error]: error_status: 0x0000000000000400 [4929915.523111] {2}[Hardware Error]: physical_address: 0x0000002bec92de80 [4929915.523114] {2}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22228 column: 888 [4929915.523115] {2}[Hardware Error]: error_type: 2, single-bit ECC [4929915.523125] mce: [Hardware Error]: Machine check events logged [4929928.059079] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [4929928.059081] {3}[Hardware Error]: It has been corrected by h/w and requires no further action [4929928.059082] {3}[Hardware Error]: event severity: corrected [4929928.059084] {3}[Hardware Error]: Error 0, type: corrected [4929928.059085] {3}[Hardware Error]: fru_text: A5 [4929928.059086] {3}[Hardware Error]: section_type: memory error [4929928.059088] {3}[Hardware Error]: error_status: 0x0000000000000400 [4929928.059089] {3}[Hardware Error]: physical_address: 0x0000000bf402df80 [4929928.059092] {3}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 [4929928.059093] {3}[Hardware Error]: error_type: 2, single-bit ECC [4929928.059102] mce: [Hardware Error]: Machine check events logged [4929942.246950] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [4929942.246952] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [4929942.246953] {4}[Hardware Error]: event severity: corrected [4929942.246954] {4}[Hardware Error]: Error 0, type: corrected [4929942.246956] {4}[Hardware Error]: fru_text: A5 [4929942.246957] {4}[Hardware Error]: section_type: memory error [4929942.246958] {4}[Hardware Error]: error_status: 0x0000000000000400 [4929942.246959] {4}[Hardware Error]: physical_address: 0x0000000bf402df80 [4929942.246963] {4}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 1 row: 22352 column: 888 [4929942.246964] {4}[Hardware Error]: error_type: 2, single-bit ECC [4930172.678164] mce: [Hardware Error]: Machine check events logged
So maybe the cycle between error, then self correcting it?
I'm in Singapore until Sunday morning, please let me know if there is anything I can do onsite.