Page MenuHomePhabricator

cr3-eqsin disk 1 failure
Closed, ResolvedPublic

Description

Today at ~11:22UTC, the primary disk on the new cr3-eqsin crashed, which caused the router to reboot to the backup disk which didn't have any configuration.

Everything failed over cleanly to cr2-eqsin.

It's currently only reachable over console.

root> show system alarms 
2 alarms currently active
Alarm time               Class  Description
2020-07-05 11:29:50 UTC  Minor  VMHost RE 0 Disk 1 Missing
2020-07-05 11:28:45 UTC  Minor  VMHost 0 Boot from alternate disk

We need to figure out:

  • where the fault is exactly (disk, bus, software, etc) via JTAC
  • if we can/should bring it back online from the backup disk
  • vmhost snapshot - T257153

Next steps:

  • Open JTAC ticket
  • Discus if we should load its config from the rancid backup
    • I'd say yes as worse case it re-fails (cleanly) the same way, but brings us back to proper redundancy in the meantime

Related Objects

StatusSubtypeAssignedTask
Resolvedayounsi

Event Timeline

ayounsi triaged this task as High priority.Jul 5 2020, 7:16 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Opened JTAC case 2020-0705-0136

Note that there are no mentions of hard drives in Juniper's MX204 doc. Even less in the FRUs list, so it might mean replacing the chassis.

The doc about that specific error is also not very verbose:

Check if there is missing or a defective disk. Insert healthy disk.

JTAC asked for some more troubleshooting commands, the interesting one is:

root> show vmhost hardware    
Compute cluster: rainier-re-cc

 Compute node: rainier-re-cn                   
  Hardware inventory:
   Item       Capacity             Part number               Serial number             Description
   DIMM 0     16384 MB             M4R0-AGS1BCRG-B553        [REDACTED]                DDR4 2133 MHz
   DIMM 1     16384 MB             M4R0-AGS1BCRG-B553        [REDACTED]               DDR4 2133 MHz
   Disk1      100  GB              SFSA100GM3AA4TO           [REDACTED]      SLIM SATA SSD

The same command on a healthy MX204 shows a Disk2.

Change 609571 had a related patch set uploaded (by Ayounsi; owner: Giuseppe Lavagetto):
[operations/dns@master] Depool eqsin

https://gerrit.wikimedia.org/r/609571

Change 609571 merged by Ayounsi:
[operations/dns@master] Depool eqsin

https://gerrit.wikimedia.org/r/609571

Mentioned in SAL (#wikimedia-operations) [2020-07-06T06:55:24Z] <XioNoX> depool eqsin for cr3-eqsin reboot/investigation - T257154

root@cr3-eqsin> request vmhost snapshot recovery 
warning: Existing data on the target may be lost
Proceed ? [yes,no] (no) yes 

warning: Proceeding with vmhost snapshot
Current root details,           Device sda, Label: jrootp_S, Partition: sda3
Snapshot admin context from current boot disk to target disk ...
Proceeding with snapshot on primary disk
Mounting device in preparation for snapshot...
Mounting /dev/disk/by-label/efi_P failed - return code - 32
Vmhost snapshot aborted
Software snapshot failed

I also rebooted cr3-eqsin without success and recorded the whole boot process output for JTAC, then asked to escalate the task for better timezone coverage.

I brought back the router online by loading the Homer config as well as the Rancid one. Unfortunately this means that we lost the MD5 password of the peering sessions (only a handful) as they are not present in Homer and redacted in Rancid.
Edit: found all but one of them after some email archeology.

Change 609627 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Revert "Depool eqsin"

https://gerrit.wikimedia.org/r/609627

Change 609627 merged by Ayounsi:
[operations/dns@master] Revert "Depool eqsin"

https://gerrit.wikimedia.org/r/609627

JTAC says that we need to RMA the whole chassis. I added @RobH to the email thread to take care of shipping.

RobH mentioned this in Unknown Object (Task).Jul 6 2020, 7:42 PM
RobH added a subtask: Unknown Object (Task).Jul 6 2020, 8:02 PM
RobH added a project: ops-eqsin.
RobH moved this task from Backlog to Hardware Failure / Repair on the ops-eqsin board.

RMA shipment info:

UPS: 75095182
https://www.upspostsaleslogistics.com/cfw/tracking.screen

RMA # R200300857

Delivery is set for tomorrow via eq ticket 1-199245818233

Delivery has occurred, but it seems SG3 requires a smart hands ticket to deliver to our rack (they disregard the deliver to cage checkbox on incoming shipments for SG3). I've entered new smart hands request 1-199299716748 for delivery of the replacement parts to our cage, and detailed they need to KEEP the shipping materials for our reuse in sending back the defective part.

IRC disucssion with Arzhel resulted in our scheduling this work for 2020-07-16 @ 21:30 Singapore / 0630 Pacific / 1530 CET.

I've emailed Jin (Arzhel CC'd) to confirm this scheduling.

Change 613143 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool eqsin for router replacement

https://gerrit.wikimedia.org/r/613143

Change 613143 merged by Ayounsi:
[operations/dns@master] Depool eqsin for router replacement

https://gerrit.wikimedia.org/r/613143

Mentioned in SAL (#wikimedia-operations) [2020-07-16T13:23:27Z] <XioNoX> depool eqsin for cr3 replacement - T257154

Mentioned in SAL (#wikimedia-operations) [2020-07-16T13:30:08Z] <XioNoX> deactivate BGP groups IX/Transit/PyBal on cr3-eqsin - T257154

Mentioned in SAL (#wikimedia-operations) [2020-07-16T14:09:30Z] <XioNoX> upgrade junos on cr3-eqsin - T257154

Router is back online.
One weird thing is that once I rebooted the router after the upgrade, it came back up with the same issue as here (no more primary disk).
I asked Jin to do a hard power down (unplugging the power cords after shutting it down). Everything came back up nicely afterwards.

Everything here is done!

RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Jul 21 2020, 3:37 PM
RobH closed subtask Unknown Object (Task) as Resolved.Jul 23 2020, 2:06 PM