Page MenuHomePhabricator

asw-c4-eqiad hardware fault?
Closed, ResolvedPublic


Today (2015-03-24) at ~09:30 UTC, we lost asw-c4. At 09:47 I hit "enter" on the console server and the switch emitted "Rebooting...", rebooted itself and ultimately came back up online.

This is exactly the same symptom on the exact same switch as Incident-20141130-eqiad-C4, so this starts to look like too much of a coincidence and possibly a hardware fault.

We should investigate further and possibly replace the switch.

Event Timeline

faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon added a subscriber: faidon.
faidon set Security to None.

@Cmjohnson, how much time do you think it would take you to unrack C4, rack a replacement and move all the connections?

The steps for this are:

What's surprisingly missing from the list is the fact that we also need to make sure we have the same JunOS version on the replacement switch beforehand, right before we reset it to factory defaults.

(in any case, suffice to say, this needs further planning)

The racking portion will only take 15 minutes...that's removing the old switch and replacing with a new switch. The current JunOS version on asw-c5 is JUNOS 11.4 R6.5. I went to the Juniper support page and can download the correct version if needed and install prior to swapping the switch. I will update task with the replacement switches OS version.

Worth noting is the current JunOS versions are 12.3.

I will switch the hadoop active namenode from analytics1001 to analytics1002 (which is in another rack) before this replacement action is taken.

It may have been a typo, but @Cmjohnson reported the version number for asw-c5 instead of c4, so I have confirmed that every switch in that row is running 11.4R6.5.

Restoring the backup switch to factory config.

The current OS of the backup switch is
Hostname: asw-a-eqiad
Model: ex4200-48t
JUNOS Base OS boot [10.4R3.4]
JUNOS Base OS Software Suite [10.4R3.4]
JUNOS Kernel Software Suite [10.4R3.4]
JUNOS Crypto Software Suite [10.4R3.4]
JUNOS Online Documentation [10.4R3.4]
JUNOS Enterprise Software Suite [10.4R3.4]
JUNOS Packet Forwarding Engine Enterprise Software Suite [10.4R3.4]
JUNOS Routing Software Suite [10.4R3.4]
JUNOS Web Management [10.4R3.4]

@Cmjohnson, can you plug the switch somewhere on a console port? Also, if you're absolutely sure it's zero'ed (the above still has a "hostname" so it doesn't look it is!), also plug the ethernet management port somewhere.

Other than that, I've pinged affected folks and asked for the best window for a downtime.

Services definitely affected would be: EventLogging, Phabricator, Graphite, Logstash, udp2log, stat1003, webperf, releases, labsdb1006/7, osmium. Hosts under clustered services that would be affected but not incur a downtime would be rdb1001, analytics1001, ganeti1001/2, rcs1001, lead.

We've agreed for a maintenance window of replacing the switch: Wednesday May 6th, 13:00 UTC.

The current version of asw-c4-eqiad is 11.4 R6.5...the download on the juniper site is 11.4 R6.6.

11.4R6.5 for both EX4200 and EX4500 are now available on our server.

Upgrade has been completed

  • JUNOS 11.4R6.5 built 2012-11-28 20:02:31 UTC

S/N is
Item Version Part number Serial number Description
Chassis BP0211500170 EX4200-48T

OK, I also request virtual-chassis mode mixed the switch and rebooted it to handle it being added into a mixed VC.

The steps for tomorrow would be:

  1. @Cmjohnson and myself ping each other on IRC at 13:00 UTC to coordinate the go/no-go
  2. Current C4 switch is powered-off and subsequently unplugged
  3. set virtual-chassis member 4 serial-number BP0211500170 is executed & commited on asw-c-eqiad (the remaining stack)
  4. The new C4 switch gets connected to the rest of the stack using only one of the VCPs
  5. The new C4 switch is powered on
  6. The second VCP gets connected to the stack
  7. The ports are moved from the old switch to the new one.

I'll perform step (2) to minimize downtime while Chris plugs/racks. I'll also be checking logs during each of those steps.

Switch replacement is done. End-to-end downtime for the affected hosts was ~20mins, starting at 13:01 UTC, well within our window.