Page MenuHomePhabricator

Power incident in eqsin
Closed, ResolvedPublic

Description

We lost one of the two power feed in eqsin at 14:16UTC today.
No user impact (everything critical has redundant power)
Site has been depooled at 14:23UTC to be on the safe side https://gerrit.wikimedia.org/r/c/operations/dns/+/466890

Notification about this schedule outage was sent on Sept 7th, then 1h30min before the outage.
Maintenance window is:
UTC: FRIDAY, 12 OCT 14:00 - SATURDAY, 13 OCT 00:00

Should check the maint-announce emails before re-pooling to be sure it's back to normal.

Event Timeline

ayounsi triaged this task as Normal priority.Oct 12 2018, 2:32 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptOct 12 2018, 2:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi updated the task description. (Show Details)Oct 12 2018, 2:32 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-14T16:34:03Z] <volans> forcing a puppet run on all eqsin hosts with batch 1 to clear most of the alarms - T206861

faidon added a subscriber: faidon.Oct 14 2018, 4:34 PM

Update: a few hours later, power seemingly got back, so at 2018-10-13 03:07 UTC @BBlack repooled eqsin (logged at SAL).

Unfortunately, power never got back to cr1-eqsin's PEM 0, asw1-eqsin's PEM 0 and the sole feed of mr1-eqiad (possibly among others?) although @BBlack remembers seeing both feeds up on servers via IPMI.

Today (2018-10-14) at ~14:15 UTC the second power feed went down, also on scheduled maintenance (starting at 14:00 UTC). That took down -at least- cr1-eqsin's PEM 1 and given that PEM 0 was also down, that took down cr1-eqsin entirely, and the entire site went dark. This paged, and we had multiple people respondents. At 14:16 @BBlack depooled the site. This resulted in an outage for most of APAC for 10 minutes (non-obeying DNS resolvers aside).

At 14:38 I contacted Equinix's NOC to report a double power failure. I got an autoresponse with an assigned request ID 9-170629874152. At 16:02 I got a response from the service desk for a $0.00 order, 1-170629192166 and another email 5 minutes later indicated that I would "be contacted by a technician in 30 to 45 minutes".

At 16:25 we started getting recovery alerts/pages. Preliminary investigation reveals that cr1-eqsin's PEM 0 is up, as is mr1-eqiad. cr1-eqsin's PEM 1 is still down, but that is to be expected as we are still in that power feed's maintenance window.

Site is still depooled and will remain so until a) said maintenance window is over b) we do a more thorough investigation on the root causes.

Volans added a subscriber: Volans.Oct 14 2018, 4:37 PM

Once recovered we found that those hosts had a 5 minutes uptime:

dns5001.wikimedia.org
lvs5001.eqsin.wmnet
bast5001.wikimedia.org
cp5011.eqsin.wmnet
cp5009.eqsin.wmnet
cp5007.eqsin.wmnet

Looking at https://netbox.wikimedia.org/dcim/racks/77/ , in addition with the network devices, this seems to match with the top-half part of the rack loosing power on both feeds.

Respond from Equinix:

With regards to this Trouble ticket, we went onsite and observed the following,
R0604 A Feed is still on live and all equipment are still powered up
R0603- A Feed in-rack breaker is being tripped and we have re-set the breaker and all equipment have been powered up.
Kindly verify at your end. Thank you.

Verified, and explains but those two points:

  • How come the bottom half of R0603 remained online, as reported by @Volans?
  • Who is monitoring those PDUs? I believe we own them, but I don't think we are monitoring them. Is Equinix?
Volans added a subscriber: MoritzMuehlenhoff.EditedOct 14 2018, 5:20 PM

On bast5001 and dns5001 ferm failed to start at reboot due to failed DNS resolution query. The next puppet runs didn't restart it. I had to manually start it. The host have been 55 minutes without ferm rules applied. CC @MoritzMuehlenhoff

Volans added a comment.EditedOct 14 2018, 5:50 PM

Current status recap:

  • Maintenance on one power line is still ongoing, all servers are reported up and running, without icinga alarms but the loss of power redundancy.
  • cr1-eqsin: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms
  • asw1-eqsin: JNX_ALARMS WARNING - 0 red alarms, 2 yellow alarms
  • The RIPE Atlas is still down as it's attached to the power line that is still under maintenance and doesn't have dual power.
  • I've opened 2 subtasks for things to follow up that I had to manually fix.

It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online.

faidon renamed this task from 1 power feed down in eqsin to Power incident in eqsin.Oct 14 2018, 7:57 PM

It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online.

That's great! Note that the maintenance window was listed as being:

UTC: SUNDAY, 14 OCT 14:00 - MONDAY, 15 OCT 00:00

...so that's still 4 hours away. It's possible they're done with all of it early, but we should wait regardless.

Confirmed that all the network devices are back to a healthy state. And we received a completion notice, should be safe to repool the site.

  • How come the bottom half of R0603 remained online, as reported by @Volans?

My guess is that the top and bottom half of the PDUs are on different circuit breakers (similar to https://www.sjmfg.com.sg/SJviet/vietnamres/12C13%204C19%2032A(1).jpg )

  • Who is monitoring those PDUs? I believe we own them, but I don't think we are monitoring them. Is Equinix?

As far as I know, the PDUs are non-manageable (cf. http://www.sjmfg.com.sg/product_thrupower.html )

Change 467289 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Repool eqsin after power maintenance

https://gerrit.wikimedia.org/r/467289

Change 467289 merged by Ayounsi:
[operations/dns@master] Repool eqsin after power maintenance

https://gerrit.wikimedia.org/r/467289

Mentioned in SAL (#wikimedia-operations) [2018-10-15T08:49:40Z] <XioNoX> repool eqsin - T206861

ema moved this task from Triage to Hardware on the Traffic board.Oct 22 2018, 8:41 AM
greg added a subscriber: greg.Nov 27 2018, 7:23 PM

Just checking: this task is in the "active situation" column of the Wikimedia-Incident project and has been open for a while. I see there are sub-tasks that look like follow-ups. Should the active situation task (this one) be closed so it's no longer listed in that column? (it would make my life easier as I review that column on my phabricator home dashboard ;) )

Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually. Maybe add incident tag and move to follow-up column for T206951? (the other is already there)

greg closed this task as Resolved.Nov 27 2018, 7:29 PM
greg assigned this task to ayounsi.

Done, thanks!