15:21:47 <+icinga-wm> PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% 15:23:25 <+icinga-wm> PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets 15:23:28 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down 15:23:33 <+icinga-wm> PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down 15:23:49 <+icinga-wm> PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status 15:24:05 <+icinga-wm> PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status 15:25:03 <+icinga-wm> PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% 15:25:19 <+icinga-wm> PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% 15:41:45 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5015 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:42:59 <+icinga-wm> PROBLEM - IPMI Sensor Status on lvs5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:44:59 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5009 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:47:21 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:48:35 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:52:18 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5003 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:52:26 <+icinga-wm> PROBLEM - IPMI Sensor Status on ganeti5001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:52:51 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:53:26 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:53:51 <+icinga-wm> PROBLEM - IPMI Sensor Status on lvs5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:54:56 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5013 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:56:04 <+icinga-wm> PROBLEM - IPMI Sensor Status on cp5006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures 15:56:25 <+icinga-wm> PROBLEM - IPMI Sensor Status on ganeti5002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Depool eqsin | operations/dns | master | +1 -0 |
Related Objects
- Mentioned Here
- T206861: Power incident in eqsin
Event Timeline
Thanks for the ping, this seems to be a single rack problem - https://netbox.wikimedia.org/dcim/racks/77/ (rack 603)
19:17 <elukey> XioNoX: around? 19:20 <XioNoX> elukey: what's up? 19:21 <XioNoX> looks like mgmt router is down so mgmt network is unreachable 19:21 +<icinga-wm> PROBLEM - DNS on ganeti5001.mgmt is CRITICAL: DNS CRITICAL - expected 0.0.0.0 but got 10.132.129.113 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook 19:21 <elukey> it seems a PS redundancy failure for one rack, but I didn't get the asw1-eqsin down alert 19:22 <elukey> PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% 19:22 <XioNoX> UTC:ISATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00 for the Equinix maintenance 19:22 <elukey> ah right before the PS failure, lovely timing 19:22 <elukey> okok 19:22 <XioNoX> elukey: probably because we lost asw1 mgmt before it could alert 19:22 <XioNoX> checking the CR 19:23 <XioNoX> @cr3-eqsin> show system alarms 19:23 <XioNoX> 2021-07-03 14:19:26 UTC Major PEM 1 Not Powered 19:23 <elukey> I asked to Traffic to check later on, Valentin should be home later and will verify if we have to do anything or not for the PS failure (in theory no but not sure) 19:23 <elukey> can you translate? :D 19:24 <XioNoX> elukey: unless Equinix unplugs the wrong power feed we should be good 19:24 <XioNoX> elukey: see "SERVICE IMPACTING MAINTENANCE Scheduled Customer Outage in 1 hour: Shutdown Maintenance of PDU, ACB and LV Switchboard at L6 A2 at SG3 [5-206136314890]" 19:24 <XioNoX> basically equinix is doing power maintenance on the power feeds 19:25 <elukey> ah okok now I get it, I didn't think about checking these things in the mailing list 19:25 <elukey> going to update https://phabricator.wikimedia.org/T286113 19:25 <XioNoX> after they're done we will need to check that everything came back up too
So nothing on fire, In Equinix they are performing some maintenance, we'll need to check later on if everything comes back fine.
Info about a similar use case (credits to Arzhel): https://phabricator.wikimedia.org/T206861#4664474
Things to decide:
- Do we need to depool eqsin?
- Do we need to contact Equinix for any follow up?
Change 703031 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/dns@master] Depool eqsin
It seems that also rack 604 is affected, so both eqsin racks. Created a code change to depool eqsin as precautionary measure, but I'd need the Traffic team's approval first.
Mentioned in SAL (#wikimedia-operations) [2021-07-03T17:46:19Z] <elukey> depool eqsin due to loss of power redundancy (equinix maintenance) - T286113
Me and Arzhel decided to depool eqsin, the PS redundancy failure's maintenance window seems to be:
UTC: SATURDAY, 03 JUL 14:00 - SATURDAY, 03 JUL 22:00
Next steps:
- check varnish traffic on other DCs, we should be ok but better to double check
- wait for the PS redundancy issue to be fixed, then check that al PSes are up etc..
- read https://phabricator.wikimedia.org/T206861#4664474 to avoid the same mistake again :D
- repool eqsin
The problem seems to be fixed, and just to be sure:
elukey@cr3-eqsin> show chassis environment pem PEM 0 status: State Online Airflow Front to Back Temperature OK 50 degrees C / 122 degrees F Temperature OK 52 degrees C / 125 degrees F Firmware version 00.06 Fan Sensor 17040 RPM DC Output Voltage(V) Current(A) Power(W) Load(%) 11.00 8 88 13 PEM 1 status: State Online Airflow Front to Back Temperature OK 49 degrees C / 120 degrees F Temperature OK 51 degrees C / 123 degrees F Firmware version 00.06 Fan Sensor 15510 RPM DC Output Voltage(V) Current(A) Power(W) Load(%) 11.00 8 88 13 elukey@asw1-eqsin> show chassis environment pem | match State State Online State Online State Online State Online
Also connected to cp5002's mgmt serial and checked racadm getsel:
------------------------------------------------------------------------------- Record: 27 Date/Time: 07/03/2021 15:10:18 Source: system Severity: Critical Description: Power supply redundancy is lost. ------------------------------------------------------------------------------- Record: 28 Date/Time: 07/03/2021 22:05:38 Source: system Severity: Ok Description: The input power for power supply 1 has been restored. ------------------------------------------------------------------------------- Record: 29 Date/Time: 07/03/2021 22:05:43 Source: system Severity: Ok Description: The power supplies are redundant. -------------------------------------------------------------------------------
All good, +1 to repool in my opinion.
Mentioned in SAL (#wikimedia-operations) [2021-07-04T08:02:40Z] <elukey> repool eqsin after equinix maintenance - T286113