Maniphest T206861

Power incident in eqsin
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ayounsi
	Oct 12 2018, 2:32 PM

Description

We lost one of the two power feed in eqsin at 14:16UTC today.
No user impact (everything critical has redundant power)
Site has been depooled at 14:23UTC to be on the safe side https://gerrit.wikimedia.org/r/c/operations/dns/+/466890

Notification about this schedule outage was sent on Sept 7th, then 1h30min before the outage.
Maintenance window is:
UTC: FRIDAY, 12 OCT 14:00 - SATURDAY, 13 OCT 00:00

Should check the maint-announce emails before re-pooling to be sure it's back to normal.

Details

	Subject	Repo	Branch	Lines +/-
	Repool eqsin after power maintenance	operations/dns	master	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• ayounsi	T206861 Power incident in eqsin
Resolved	• ema	T206950 Icinga: check_confd_vcl_reload unknown when file is missing
Resolved	jbond	T206951 Puppet doesn't restart ferm on failure
Resolved	• ayounsi	T207138 Document eqsin power connections in Netbox
Resolved	RobH	T207140 Add maint-announce@ to Equinix's recipient list for eqsin incidents

Event Timeline

• ayounsi triaged this task as Medium priority.Oct 12 2018, 2:32 PM

• ayounsi created this task.

Restricted Application added a project: SRE. · View Herald TranscriptOct 12 2018, 2:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ayounsi updated the task description. (Show Details)Oct 12 2018, 2:32 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-14T16:34:03Z] <volans> forcing a puppet run on all eqsin hosts with batch 1 to clear most of the alarms - T206861

Update: a few hours later, power seemingly got back, so at 2018-10-13 03:07 UTC @BBlack repooled eqsin (logged at SAL).

Unfortunately, power never got back to cr1-eqsin's PEM 0, asw1-eqsin's PEM 0 and the sole feed of mr1-eqiad (possibly among others?) although @BBlack remembers seeing both feeds up on servers via IPMI.

Today (2018-10-14) at ~14:15 UTC the second power feed went down, also on scheduled maintenance (starting at 14:00 UTC). That took down -at least- cr1-eqsin's PEM 1 and given that PEM 0 was also down, that took down cr1-eqsin entirely, and the entire site went dark. This paged, and we had multiple people respondents. At 14:16 @BBlack depooled the site. This resulted in an outage for most of APAC for 10 minutes (non-obeying DNS resolvers aside).

At 14:38 I contacted Equinix's NOC to report a double power failure. I got an autoresponse with an assigned request ID 9-170629874152. At 16:02 I got a response from the service desk for a $0.00 order, 1-170629192166 and another email 5 minutes later indicated that I would "be contacted by a technician in 30 to 45 minutes".

At 16:25 we started getting recovery alerts/pages. Preliminary investigation reveals that cr1-eqsin's PEM 0 is up, as is mr1-eqiad. cr1-eqsin's PEM 1 is still down, but that is to be expected as we are still in that power feed's maintenance window.

Site is still depooled and will remain so until a) said maintenance window is over b) we do a more thorough investigation on the root causes.

Once recovered we found that those hosts had a 5 minutes uptime:

dns5001.wikimedia.org
lvs5001.eqsin.wmnet
bast5001.wikimedia.org
cp5011.eqsin.wmnet
cp5009.eqsin.wmnet
cp5007.eqsin.wmnet

Looking at https://netbox.wikimedia.org/dcim/racks/77/ , in addition with the network devices, this seems to match with the top-half part of the rack loosing power on both feeds.

Respond from Equinix:

With regards to this Trouble ticket, we went onsite and observed the following,
R0604 A Feed is still on live and all equipment are still powered up
R0603- A Feed in-rack breaker is being tripped and we have re-set the breaker and all equipment have been powered up.
Kindly verify at your end. Thank you.

Verified, and explains but those two points:

How come the bottom half of R0603 remained online, as reported by @Volans?
Who is monitoring those PDUs? I believe we own them, but I don't think we are monitoring them. Is Equinix?

On bast5001 and dns5001 ferm failed to start at reboot due to failed DNS resolution query. The next puppet runs didn't restart it. I had to manually start it. The host have been 55 minutes without ferm rules applied. CC @MoritzMuehlenhoff

Krenair subscribed.Oct 14 2018, 5:43 PM

Current status recap:

Maintenance on one power line is still ongoing, all servers are reported up and running, without icinga alarms but the loss of power redundancy.
cr1-eqsin: JNX_ALARMS CRITICAL - 2 red alarms, 0 yellow alarms
asw1-eqsin: JNX_ALARMS WARNING - 0 red alarms, 2 yellow alarms
The RIPE Atlas is still down as it's attached to the power line that is still under maintenance and doesn't have dual power.
I've opened 2 subtasks for things to follow up that I had to manually fix.

Paladox subscribed.Oct 14 2018, 6:41 PM

It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online.

faidon renamed this task from 1 power feed down in eqsin to Power incident in eqsin.Oct 14 2018, 7:57 PM

In T206861#4664738, @Volans wrote:

It seems that the power has been restored, all the outstanding alarms have recovered and also the RIPE Atlas is back online.

That's great! Note that the maintenance window was listed as being:

UTC: SUNDAY, 14 OCT 14:00 - MONDAY, 15 OCT 00:00

...so that's still 4 hours away. It's possible they're done with all of it early, but we should wait regardless.

Confirmed that all the network devices are back to a healthy state. And we received a completion notice, should be safe to repool the site.

In T206861#4664498, @faidon wrote:

How come the bottom half of R0603 remained online, as reported by @Volans?

My guess is that the top and bottom half of the PDUs are on different circuit breakers (similar to https://www.sjmfg.com.sg/SJviet/vietnamres/12C13%204C19%2032A(1).jpg )

Who is monitoring those PDUs? I believe we own them, but I don't think we are monitoring them. Is Equinix?

As far as I know, the PDUs are non-manageable (cf. http://www.sjmfg.com.sg/product_thrupower.html )

Change 467289 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Repool eqsin after power maintenance

https://gerrit.wikimedia.org/r/467289

gerritbot added a project: Patch-For-Review.Oct 15 2018, 8:46 AM

Change 467289 merged by Ayounsi:
[operations/dns@master] Repool eqsin after power maintenance

https://gerrit.wikimedia.org/r/467289

Mentioned in SAL (#wikimedia-operations) [2018-10-15T08:49:40Z] <XioNoX> repool eqsin - T206861

faidon edited projects, added Wikimedia-Incident; removed Patch-For-Review.Oct 15 2018, 4:09 PM

faidon mentioned this in T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents.Oct 16 2018, 9:14 AM

greg moved this task from Active investigation to Active Situation on the Wikimedia-Incident board.Oct 16 2018, 6:49 PM

RobH closed subtask T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents as Resolved.Oct 17 2018, 4:27 PM

• ema moved this task from Backlog to Hardware on the Traffic board.Oct 22 2018, 8:41 AM

• ayounsi closed subtask T207138: Document eqsin power connections in Netbox as Resolved.Oct 24 2018, 10:30 PM

faidon reopened subtask T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents as Open.Oct 25 2018, 1:07 PM

• ema closed subtask T206950: Icinga: check_confd_vcl_reload unknown when file is missing as Resolved.Oct 30 2018, 11:35 AM

RobH changed the status of subtask T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents from Open to Stalled.Nov 9 2018, 4:13 PM

Just checking: this task is in the "active situation" column of the Wikimedia-Incident project and has been open for a while. I see there are sub-tasks that look like follow-ups. Should the active situation task (this one) be closed so it's no longer listed in that column? (it would make my life easier as I review that column on my phabricator home dashboard ;) )

Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually. Maybe add incident tag and move to follow-up column for T206951? (the other is already there)

Done, thanks!

RobH closed subtask T207140: Add maint-announce@ to Equinix's recipient list for eqsin incidents as Resolved.Dec 18 2018, 4:45 PM

elukey mentioned this in T286113: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers.Jul 3 2021, 5:31 PM

Vgutierrez closed subtask T206951: Puppet doesn't restart ferm on failure as Resolved.Mar 21 2023, 9:44 AM

Power incident in eqsinClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Power incident in eqsin
Closed, ResolvedPublic
Actions

Related Objects
Search...