Page MenuHomePhabricator

es2021 (B3) lost power supply redundancy
Closed, ResolvedPublic

Description

Update: Since 2022-10-11:

es2021 IPMI Sensor Status View Extra Service Notes 	CRITICAL 	2022-10-18 08:11:15 	6d 16h 5m 29s 	3/3 	Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical]

@Papaul can you check the status of es2021 power supplies. It lives in B3 and we just got this alert:

<+icinga-wm> PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures

Thanks!

Event Timeline

Marostegui renamed this task from es2021 (B3) now power supply redudancy to es2021 (B3) lost power supply redundancy.Aug 4 2022, 8:55 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-08-05T13:27:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool hosts with fragile power supply (T314559 T314628)', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json

This is complete

@Ladsgroup @Marostegui for reference, this is the couple of one liners I am using on cumin2002 to check the latest 2 million rows for each table:

# mysql.py -BN -h es1021 -e "SELECT table_schema FROM information_schema.tables WHERE table_name = 'blobs_cluster26' ORDER BY table_schema" | while read db; do mysql.py -BN -h es1021 $db -e "SELECT '$db', max(blob_id) FROM blobs_cluster26"; done | tee tables_to_check.txt

# grep -v NULL tables_to_check.txt | while read db rows; do echo -e "\n== $db ==\n"; db-compare $db blobs_cluster26 blob_id es1021 es2021 --step=100 --from-value=$(($rows - 2000000)) || break ; done

I chose 2 million rows as that should cover way beyond 2 days of data- while not checking the full 90 million enwiki rows or the 300 million wikidata rows from the previous years.

Mentioned in SAL (#wikimedia-operations) [2022-08-09T08:24:18Z] <jynus> starting data check using es1021 and es2021, expect increased read traffic T314559

After 38 hours of checking and 7 milion rows compared to eqiad's es1021, I can confidently say that data was in a good state after the crash.

\* Sorry, there was 7 million SELECT operations, which meant between 100 million and 700 million rows

jcrespo moved this task from Done to Refine on the DBA board.

@Papaul, this is reocurring- my guess is the cable is unfit so it got loose again. Assuming it is that (or if you can provide further insight), maybe requesting a cable or power unit replacement is preferred, or securing it with a tie? Maybe it is a fluke because another close host is regularly serviced- only you may know! 0:-) Let me know what we can do to prevent it from reoccurring.

Given the importance of the host, please let us know before handling it, so we can depool and stop the server to prevent another accidental power loss, as it takes very little time to shut it down.

Replaced both power cords and upgrade IDRAC. System is back online

Thank you, Papaul- that seems to have fixed it.

0d 0h 17m 5s 	1/3 	Sensor Type(s) Temperature, Power_Supply Status: OK