Page MenuHomePhabricator

es2021 (B3) lost power supply redundancy
Closed, ResolvedPublic

Description

@Papaul can you check the status of es2021 power supplies. It lives in B3 and we just got this alert:

<+icinga-wm> PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures

Thanks!

Event Timeline

Marostegui renamed this task from es2021 (B3) now power supply redudancy to es2021 (B3) lost power supply redundancy.Thu, Aug 4, 8:55 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-08-05T13:27:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool hosts with fragile power supply (T314559 T314628)', diff saved to https://phabricator.wikimedia.org/P32292 and previous config saved to /var/cache/conftool/dbconfig/20220805-132709-ladsgroup.json

This is complete

@Ladsgroup @Marostegui for reference, this is the couple of one liners I am using on cumin2002 to check the latest 2 million rows for each table:

# mysql.py -BN -h es1021 -e "SELECT table_schema FROM information_schema.tables WHERE table_name = 'blobs_cluster26' ORDER BY table_schema" | while read db; do mysql.py -BN -h es1021 $db -e "SELECT '$db', max(blob_id) FROM blobs_cluster26"; done | tee tables_to_check.txt

# grep -v NULL tables_to_check.txt | while read db rows; do echo -e "\n== $db ==\n"; db-compare $db blobs_cluster26 blob_id es1021 es2021 --step=100 --from-value=$(($rows - 2000000)) || break ; done

I chose 2 million rows as that should cover way beyond 2 days of data- while not checking the full 90 million enwiki rows or the 300 million wikidata rows from the previous years.

Mentioned in SAL (#wikimedia-operations) [2022-08-09T08:24:18Z] <jynus> starting data check using es1021 and es2021, expect increased read traffic T314559

After 38 hours of checking and 7 milion rows compared to eqiad's es1021, I can confidently say that data was in a good state after the crash.

\* Sorry, there was 7 million SELECT operations, which meant between 100 million and 700 million rows