Page MenuHomePhabricator

cr2-eqsin: fan failure
Closed, ResolvedPublic

Description

cr2-eqsin> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2020-11-09 11:21:27 UTC  Major  Fan Tray 2 Pair 0 Outer Fan stopped due to rotor failure
cr2-eqsin> show chassis environment 
Class Item                           Status     Measurement
[...]
Fans        Fan Tray 2 Fan 1               Failed

We need to initiate a RMA with Juniper and have remote hands do the swap.

The fan was fine, two replacement fans did not fix the issue, but those fans worked in our other router without problem Juniper sent us a new router, so we need to swap out cr3's chassis, migrate all the connections, and send back the defective router.

DC Ops checklist for router swap:

  • - new router received by Jin
  • - Jin's new quote approved in Coupa
  • - onsite work scheduled, coordinated with Arzhel
  • - 3 hour onsite window begins
  • - new router racked in temp home for power up and programming
  • - old router removed from the rack (wiped first by netops)
  • - new router moved to old router spot after old is removed
  • - netops updated so they can push new router into service while Jin is onsite
  • - 3 hour onsite window for router swap ends
  • - post work, Jin calls Juniper Care return # to schedule pickup of old router (they don't include return tags for Singapore, but have us call to schedule pickup with RMA department of JCare.

NetOps Checklist for router swap:

  • - connect the new router temporarily for OS upgrade/software config
  • - depool site
  • - shutdown old router
  • - swap router
  • - power on new router
  • - repool site
  • - wipe old router - please do this before we unrack it!

Event Timeline

ayounsi triaged this task as High priority.Nov 9 2020, 11:29 AM
ayounsi created this task.
Restricted Application added a project: SRE. · View Herald TranscriptNov 9 2020, 11:29 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
wiki_willy assigned this task to RobH.Nov 16 2020, 5:18 PM
RobH mentioned this in Unknown Object (Task).Nov 16 2020, 5:36 PM
RobH added a comment.Nov 16 2020, 5:45 PM

I've opened up a Juniper case via their case management tool on https://casemanager.juniper.net

2020-1116-0428

They should email me with details and how to setup the tracking/shipment.

RobH added a comment.Dec 2 2020, 1:59 AM

Jin went ahead and swapped the old fan for the new one, and all documentation on https://www.juniper.net/documentation/en_US/release-independent/junos/topics/topic-map/mx204-maintain-cooling-components.html#id-maintaining-the-mx204-fan-module seems to denote this just should make the new fan detect and spin up.

Opened a new case, they are dispatching another fan to Jin directly. 2020-1201-0731

RobH added a comment.Dec 8 2020, 6:32 PM

Jin's received the new fan and we're scheduling for him to go on-site to swap on 2020*12-10 @ 0200 UTC / 2020-12-10 @ 0900 Singapore / 2020-12-09 @ 1800 Pacific.

@ayonsi: Juniper requested that if this new fan also doesn't work, we test the new fan in our other router. Is this ok with you for us to try out?

Basically the work checklist for next visit is:

    • test new fan in the failed fan slot in cr2-eqsin. If fixes, skip down to "Ship back" section.
  • if fan doesn't work in cr2-eqsin, swap the 'new' fan into cr1-eqsin to see if it works there
    • if the fan works in cr1 but not cr2, we have a bad fan controller board (not fan module) in cr2-eqsin, which is a larger issue for swap
    • if the fan doesn't work in cr1 or cr2, then juniper has sent us 2 bad replacement fans, this is unlikely.

@ayonsi: Juniper requested that if this new fan also doesn't work, we test the new fan in our other router. Is this ok with you for us to try out?

Sure. Make sure cr2-eqsin is up before messing with cr3 :)

RobH added a comment.Dec 8 2020, 9:16 PM

@ayonsi: Juniper requested that if this new fan also doesn't work, we test the new fan in our other router. Is this ok with you for us to try out?

Sure. Make sure cr2-eqsin is up before messing with cr3 :)

Uh.. is it depooled or should we have depooled cr2-eqsin? All we planned to do was hot swap the fan.

I mean make sure there are no (other than the fan) alerts about cr2-eqsin before doing anything with cr3.
As we saw, hot swap is fine, no need to depool.

RobH added a comment.Dec 8 2020, 9:25 PM

Ahh understood!

Mentioned in SAL (#wikimedia-operations) [2020-12-10T00:26:20Z] <robh> cr2-eqsin bad fan being swapped via T267544

RobH added a comment.Dec 10 2020, 12:52 AM

Summary update:

  • Jin installed the second replacement fan from Juniper into cr2-eqsin, the red led stayed red (didn't change to green) and software via ssh check by me still showed the fan in a failed state.
  • Jin swapped the replacement fan into a fan bay on cr3-eqsin (after I advised him to do so) and it immediately cleared to green on cr3-eqsin.
  • I've emailed into the support case 2020-1201-0731 to determine next steps (either a new fan controller board if that is user serviceable, or a new mx204)
    • we'll have the new part dispatched to Jin directly as it reduces issues confirming delivery.
Volans added a subscriber: Volans.Dec 13 2020, 8:46 AM

FWIW we're getting one email every hour from rancid about this. Is there any quick way to prevent/disable them by any chance?

RobH added a subscriber: CDanis.Dec 14 2020, 4:04 PM

I am not sure about how to disable rancid alerts, perhaps either @ayounsi or @CDanis knows how?

Updating this task:

  • new parts have not been dispatched yet, I was off thursday-friday, and followed up on this just this AM (sent confirmation of shipping info for new chassis.)
  • juniper is going to send an entire new mx204
  • jin will let us know when it arrives to his address
RobH reassigned this task from RobH to ayounsi.Dec 17 2020, 6:12 PM

Jin has the new router, but we're going to wait until Arzhel returns in January to swap this out. I'm planning for the first or second week of January, and advised Juniper of such (so they won't expect the defective return until then.)

@ayounsi: When you return, please review this summary. The fan swaps did not fix cr2-eqsin, as its fan controller interface is at fault. Juniper has sent us a new chassis for the mx204, so we'll need to swap out the defective cr2-eqsin for the new one. As this is a larger scale swap, it was decided to wait for your return. I've requested Jin to standby, so we're just going to need your preferred date/time for this work.

FYI: Jin's regular time to show is 8-9am Singpore (5-6pm Pacific.). I coordinate all work with him via google hangout messages, and he is quite responsive. You can schedule this work at any time however, just anything outside of normal Singpore working hours we will need to clear with Jin. I'm happy to coordinate this with him, or put you in direct contact, either works for me!

The Netops steps are:
1/ connect the new router temporarily for OS upgrade/software config
2/ depool site
3/ shutdown old router
4/ swap router
5/ power on new router
6/ repool site
7/ wipe old router

I expect all of it to last 3h max. Which could be split in 2 visits if it helps. It's also hard to do it without me being around.

But then there are also the DCops steps: inventory, returning the faulty device, etc... @RobH Do you think you need to be around for them or they can be written down?
If the latter I think January 7th at 7am UTC (3pm local) https://everytimezone.com/s/1923ae11 could be a good time.

RobH updated the task description. (Show Details)Mon, Jan 4, 4:40 PM

Change 654742 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool eqsin for router replacement

https://gerrit.wikimedia.org/r/654742

Change 654742 merged by Ayounsi:
[operations/dns@master] Depool eqsin for router replacement

https://gerrit.wikimedia.org/r/654742

Mentioned in SAL (#wikimedia-operations) [2021-01-07T07:19:38Z] <XioNoX> depool eqsin for router replacement - T267544

Mentioned in SAL (#wikimedia-operations) [2021-01-07T08:14:07Z] <XioNoX> shutdown cr2-eqsin - T267544

Mentioned in SAL (#wikimedia-operations) [2021-01-07T08:57:52Z] <XioNoX> re-enable BGP on cr2-eqsin - T267544

Mentioned in SAL (#wikimedia-operations) [2021-01-07T09:17:21Z] <XioNoX> re-pool eqsin - T267544

ayounsi closed this task as Resolved.Thu, Jan 7, 9:18 AM

All done.