Page MenuHomePhabricator

IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers
Closed, ResolvedPublic

Description

15:21:47 <+icinga-wm> PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
15:23:25 <+icinga-wm> PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
15:23:28 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
15:23:33 <+icinga-wm> PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
15:23:49 <+icinga-wm> PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
15:24:05 <+icinga-wm> PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
15:25:03 <+icinga-wm> PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
15:25:19 <+icinga-wm> PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
15:41:45 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:42:59 <+icinga-wm> PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:44:59 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:47:21 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:48:35 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:52:18 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:52:26 <+icinga-wm> PROBLEM - IPMI Sensor Status on ganeti5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:52:51 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:53:26 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:53:51 <+icinga-wm> PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:54:56 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:56:04 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
15:56:25 <+icinga-wm> PROBLEM - IPMI Sensor Status on ganeti5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures

Details

Related Objects

Event Timeline

RhinosF1 triaged this task as High priority.Jul 3 2021, 2:57 PM

Thanks for the ping, this seems to be a single rack problem - https://netbox.wikimedia.org/dcim/racks/77/ (rack 603)

19:17  <elukey> XioNoX: around?
19:20  <XioNoX> elukey: what's up?
19:21  <XioNoX> looks like mgmt router is down so mgmt network is unreachable
19:21 +<icinga-wm> PROBLEM - DNS on ganeti5001.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.113                                            
                   https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook                                                              
19:21  <elukey> it seems a PS redundancy failure for one rack, but I didn't get the asw1-eqsin down alert                                                          
19:22  <elukey> PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%                                                                              
19:22  <XioNoX>  UTC:ISATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00 for the Equinix maintenance                                                                  
19:22  <elukey> ah right before the PS failure, lovely timing
19:22  <elukey> okok
19:22  <XioNoX> elukey: probably because we lost asw1 mgmt before it could alert
19:22  <XioNoX> checking the CR
19:23  <XioNoX> @cr3-eqsin> show system alarms
19:23  <XioNoX> 2021-07-03 14:19:26 UTC  Major  PEM 1 Not Powered
19:23  <elukey> I asked to Traffic to check later on, Valentin should be home later and will verify if we have to do anything or not for the PS failure (in theory 
                no but not sure)
19:23  <elukey> can you translate? :D
19:24  <XioNoX> elukey: unless Equinix unplugs the wrong power feed we should be good                                                                              
19:24  <XioNoX> elukey: see "SERVICE IMPACTING MAINTENANCE Scheduled Customer Outage in 1 hour: Shutdown Maintenance of PDU, ACB and LV Switchboard at L6 A2 at SG3
                [5-206136314890]"
19:24  <XioNoX> basically equinix is doing power maintenance on the power feeds
19:25  <elukey> ah okok now I get it, I didn't think about checking these things in the mailing list                                                               
19:25  <elukey> going to update https://phabricator.wikimedia.org/T286113
19:25  <XioNoX> after they're done we will need to check that everything came back up too

So nothing on fire, In Equinix they are performing some maintenance, we'll need to check later on if everything comes back fine.

Info about a similar use case (credits to Arzhel): https://phabricator.wikimedia.org/T206861#4664474

Things to decide:

  1. Do we need to depool eqsin?
  2. Do we need to contact Equinix for any follow up?

Change 703031 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Depool eqsin

https://gerrit.wikimedia.org/r/703031

Thanks for the ping, this seems to be a single rack problem - https://netbox.wikimedia.org/dcim/racks/77/ (rack 603)

It seems that also rack 604 is affected, so both eqsin racks. Created a code change to depool eqsin as precautionary measure, but I'd need the Traffic team's approval first.

Mentioned in SAL (#wikimedia-operations) [2021-07-03T17:46:19Z] <elukey> depool eqsin due to loss of power redundancy (equinix maintenance) - T286113

Change 703031 merged by Elukey:

[operations/dns@master] Depool eqsin

https://gerrit.wikimedia.org/r/703031

Me and Arzhel decided to depool eqsin, the PS redundancy failure's maintenance window seems to be:

UTC:	SATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00

Next steps:

  • check varnish traffic on other DCs, we should be ok but better to double check
  • wait for the PS redundancy issue to be fixed, then check that al PSes are up etc..
  • read https://phabricator.wikimedia.org/T206861#4664474 to avoid the same mistake again :D
  • repool eqsin

The problem seems to be fixed, and just to be sure:

elukey@cr3-eqsin> show chassis environment pem 
PEM 0 status:
  State                      Online
  Airflow                    Front to Back                           
  Temperature                OK   50 degrees C / 122 degrees F       
  Temperature                OK   52 degrees C / 125 degrees F       
  Firmware version           00.06                                   
  Fan Sensor                 17040 RPM                               
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       8             88       13     
PEM 1 status:
  State                      Online
  Airflow                    Front to Back                           
  Temperature                OK   49 degrees C / 120 degrees F       
  Temperature                OK   51 degrees C / 123 degrees F       
  Firmware version           00.06                                   
  Fan Sensor                 15510 RPM                               
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       8             88       13 

elukey@asw1-eqsin> show chassis environment pem | match State 
  State                      Online
  State                      Online
  State                      Online
  State                      Online

Also connected to cp5002's mgmt serial and checked racadm getsel:

-------------------------------------------------------------------------------
Record:      27
Date/Time:   07/03/2021 15:10:18
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   07/03/2021 22:05:38
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   07/03/2021 22:05:43
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------

All good, +1 to repool in my opinion.

Only nit - cp5006's mgmt is still not reachable, we should follow up.

Mentioned in SAL (#wikimedia-operations) [2021-07-04T08:02:40Z] <elukey> repool eqsin after equinix maintenance - T286113

elukey claimed this task.