es2021 (B3) lost power supply redundancy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Aug 4 2022, 8:55 AM

Description

Update: Since 2022-10-11:

es2021 IPMI Sensor Status View Extra Service Notes 	CRITICAL 	2022-10-18 08:11:15 	6d 16h 5m 29s 	3/3 	Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical]

@Papaul can you check the status of es2021 power supplies. It lives in B3 and we just got this alert:

<+icinga-wm> PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures

Thanks!

Related Objects
Search...

Status	Assigned	Task
Resolved	Papaul	T309956 codfw: Master PDU rack/setup row A, row B, rowC and row D task
Resolved	Papaul	T310070 (Need By:TBD) rack/setup/install row B new PDUs
Resolved	Papaul	T314559 es2021 (B3) lost power supply redundancy

Event Timeline

• Marostegui created this task.Aug 4 2022, 8:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2022, 8:55 AM

• Marostegui renamed this task from es2021 (B3) now power supply redudancy to es2021 (B3) lost power supply redundancy.Aug 4 2022, 8:55 AM

• Marostegui triaged this task as Medium priority.

• Marostegui moved this task from Triage to In progress on the DBA board.

Maintenance_bot added a project: SRE.Aug 4 2022, 9:29 AM

jcrespo mentioned this in T314628: db2135 (C6) lost power supply redundancy.Aug 5 2022, 6:57 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-05T13:27:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool hosts with fragile power supply (T314559 T314628)', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Aug 8 2022, 1:16 PM

This is complete

Maintenance_bot moved this task from In progress to Done on the DBA board.Aug 8 2022, 4:29 PM

@Ladsgroup @Marostegui for reference, this is the couple of one liners I am using on cumin2002 to check the latest 2 million rows for each table:

# mysql.py -BN -h es1021 -e "SELECT table_schema FROM information_schema.tables WHERE table_name = 'blobs_cluster26' ORDER BY table_schema" | while read db; do mysql.py -BN -h es1021 $db -e "SELECT '$db', max(blob_id) FROM blobs_cluster26"; done | tee tables_to_check.txt

# grep -v NULL tables_to_check.txt | while read db rows; do echo -e "\n== $db ==\n"; db-compare $db blobs_cluster26 blob_id es1021 es2021 --step=100 --from-value=$(($rows - 2000000)) || break ; done

I chose 2 million rows as that should cover way beyond 2 days of data- while not checking the full 90 million enwiki rows or the 300 million wikidata rows from the previous years.

Mentioned in SAL (#wikimedia-operations) [2022-08-09T08:24:18Z] <jynus> starting data check using es1021 and es2021, expect increased read traffic T314559

After 38 hours of checking and ~~7 milion rows~~ compared to eqiad's es1021, I can confidently say that data was in a good state after the crash.

\* Sorry, there was 7 million SELECT operations, which meant between 100 million and 700 million rows

@Papaul, this is reocurring- my guess is the cable is unfit so it got loose again. Assuming it is that (or if you can provide further insight), maybe requesting a cable or power unit replacement is preferred, or securing it with a tie? Maybe it is a fluke because another close host is regularly serviced- only you may know! 0:-) Let me know what we can do to prevent it from reoccurring.

Given the importance of the host, please let us know before handling it, so we can depool and stop the server to prevent another accidental power loss, as it takes very little time to shut it down.

jcrespo updated the task description. (Show Details)Oct 18 2022, 8:28 AM

Replaced both power cords and upgrade IDRAC. System is back online

Thank you, Papaul- that seems to have fixed it.

0d 0h 17m 5s 	1/3 	Sensor Type(s) Temperature, Power_Supply Status: OK

es2021 (B3) lost power supply redundancyClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

es2021 (B3) lost power supply redundancy
Closed, ResolvedPublic
Actions

Related Objects
Search...