Maniphest T311939

Degraded RAID on elastic2049
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jul 3 2022, 8:32 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host elastic2049. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid0 sda2[0] sdb2[1]
      3066771456 blocks super 1.2 512k chunks
      
md0 : active raid1 sdb1[1]
      29279232 blocks super 1.2 [2/1] [_U]
      
unused devices: <none>

Details

	Subject	Repo	Branch	Lines +/-
	elastic: promote new master	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T313095: Improve Search team alerting for missing masters
Mentioned Here: T313842: Decommission elastic2049.codfw.wmnet
T294805: Service implementation for elastic10[68-83].eqiad.wmnet

Event Timeline

ops-monitoring-bot created this task.Jul 3 2022, 8:32 AM

Restricted Application added subscribers: Marostegui, Aklapper. · View Herald TranscriptJul 3 2022, 8:32 AM

Marostegui added a project: Elasticsearch.Jul 3 2022, 8:42 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 3 2022, 8:42 AM

EBernhardson merged a task: T312187: elastic2049 /srv is read-only.Jul 11 2022, 3:20 PM

EBernhardson subscribed.

This is still an ongoing issue. I tried reimaging to Bullseye, but the installer cannot detect any hard drives.

DC Ops, are you able to take a look?

RhinosF1 subscribed.Jul 13 2022, 2:57 PM

bking assigned this task to Papaul.Jul 13 2022, 3:22 PM

the install error message said

No root file system               │
 │ No root file system is defined.                 │
 │                                                 │
 │ Please correct this from the partitioning menu. │
 │

this can be possible 2 reasons
1- You have not created a valid Linux partition
2- A Linux partition exists but you have NOT defined the root partition (“/”)

@Papaul the DRAC does not detect any hard drives.

I checked under "Storage" in the Web UI, and it says " RAC0501: There are no physical disks to be displayed. 1. Check if the host system is powered off or shutdown. 2. Check if the physical disks are inserted into the enclosure or attached to the backplane."

According to the system information, the host is powered on. Let me know if you have any other suggestions on next steps.

@bking the IDRAC will only detect hard drivers under storage in the Web UI if the system has a HW raid controller.

I can check on the physical disks when i am on site tomorrow.

From the other ticket these are the messages that were coming in on dmesg before the reimage was attempted:

[Tue Jul  5 17:45:27 2022] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[Tue Jul  5 17:45:27 2022] ata2: irq_stat 0x00000040, connection status changed
[Tue Jul  5 17:45:27 2022] ata2: SError: { CommWake DevExch }
[Tue Jul  5 17:45:27 2022] ata2: hard resetting link
[Tue Jul  5 17:45:28 2022] ata2: SATA link down (SStatus 0 SControl 300)
[Tue Jul  5 17:45:28 2022] ata2: EH complete

@EBernhardson thanks this line is helpful "SATA link down" telling me i need to check connection from main board to disks. I will look into it once onsite

Change 813974 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: promote new master

https://gerrit.wikimedia.org/r/813974

Change 813974 merged by Ryan Kemper:

[operations/puppet@production] elastic: promote new master

https://gerrit.wikimedia.org/r/813974

bking mentioned this in T313095: Improve Search team alerting for missing masters.Jul 14 2022, 9:01 PM

Mentioned in SAL (#wikimedia-operations) [2022-07-15T06:08:37Z] <ryankemper> T311939 Updated list of masters for psi-codfw search to elastic2027.codfw.wmnet:9700,elastic2029.codfw.wmnet:9700,elastic2054.codfw.wmnet:9700

Following method in https://phabricator.wikimedia.org/T294805#7701855, set the new codfw psi seeds:

With:

ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst
elastic2027.codfw.wmnet:9700
elastic2029.codfw.wmnet:9700
elastic2054.codfw.wmnet:9700

Ran:

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9243/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

I looked into this yesterday and today, it looks like we are having some HW issues on this server and unfortunately the server is out of warranty so we can not have support from Dell. I will look into this again once later.

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jul 18 2022, 3:30 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

We have enough over capacity in that cluster, and this server should be scheduled for refresh next year. Let's not spend time investigating this further, but decommission this server instead.

• MPhamWMF set the point value for this task to 2.Jul 25 2022, 3:50 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Change 817346 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic2049

https://gerrit.wikimedia.org/r/817346

Decom is tracked in T313842

Gehel moved this task from Ready for Dev -- SWE to Needs Reporting on the Discovery-Search (Current work) board.Jul 26 2022, 6:53 PM

There is a decom task for this server so we can resolve this.

Degraded RAID on elastic2049Closed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Degraded RAID on elastic2049
Closed, ResolvedPublic2 Estimated Story Points
Actions