Page MenuHomePhabricator

Degraded RAID on elastic2049
Closed, ResolvedPublic2 Estimated Story Points

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2049. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      3066771456 blocks super 1.2 512k chunks
      
md0 : active raid1 sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Event Timeline

This is still an ongoing issue. I tried reimaging to Bullseye, but the installer cannot detect any hard drives.

DC Ops, are you able to take a look?

the install error message said

No root file system               │
 │ No root file system is defined.                 │
 │                                                 │
 │ Please correct this from the partitioning menu. │
 │

this can be possible 2 reasons
1- You have not created a valid Linux partition
2- A Linux partition exists but you have NOT defined the root partition (“/”)

@Papaul the DRAC does not detect any hard drives.

I checked under "Storage" in the Web UI, and it says " RAC0501: There are no physical disks to be displayed. 1. Check if the host system is powered off or shutdown. 2. Check if the physical disks are inserted into the enclosure or attached to the backplane."

According to the system information, the host is powered on. Let me know if you have any other suggestions on next steps.

@bking the IDRAC will only detect hard drivers under storage in the Web UI if the system has a HW raid controller.

I can check on the physical disks when i am on site tomorrow.

From the other ticket these are the messages that were coming in on dmesg before the reimage was attempted:

[Tue Jul  5 17:45:27 2022] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[Tue Jul  5 17:45:27 2022] ata2: irq_stat 0x00000040, connection status changed
[Tue Jul  5 17:45:27 2022] ata2: SError: { CommWake DevExch }
[Tue Jul  5 17:45:27 2022] ata2: hard resetting link
[Tue Jul  5 17:45:28 2022] ata2: SATA link down (SStatus 0 SControl 300)
[Tue Jul  5 17:45:28 2022] ata2: EH complete

@EBernhardson thanks this line is helpful "SATA link down" telling me i need to check connection from main board to disks. I will look into it once onsite

Change 813974 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: promote new master

https://gerrit.wikimedia.org/r/813974

Change 813974 merged by Ryan Kemper:

[operations/puppet@production] elastic: promote new master

https://gerrit.wikimedia.org/r/813974

Mentioned in SAL (#wikimedia-operations) [2022-07-15T06:08:37Z] <ryankemper> T311939 Updated list of masters for psi-codfw search to elastic2027.codfw.wmnet:9700,elastic2029.codfw.wmnet:9700,elastic2054.codfw.wmnet:9700

Following method in https://phabricator.wikimedia.org/T294805#7701855, set the new codfw psi seeds:

With:

ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst
elastic2027.codfw.wmnet:9700
elastic2029.codfw.wmnet:9700
elastic2054.codfw.wmnet:9700

Ran:

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9243/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

I looked into this yesterday and today, it looks like we are having some HW issues on this server and unfortunately the server is out of warranty so we can not have support from Dell. I will look into this again once later.

Gehel added subscribers: Papaul, Gehel.

We have enough over capacity in that cluster, and this server should be scheduled for refresh next year. Let's not spend time investigating this further, but decommission this server instead.

Change 817346 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic2049

https://gerrit.wikimedia.org/r/817346

There is a decom task for this server so we can resolve this.