Page MenuHomePhabricator

update RE-S-X6-64G-S in cr[12]-codfw
Open, NormalPublic0 Story Points

Description

This task will track the installation of (4) new RE-S-X6-64G-S modules for installation/swap (2 in each) in cr[12]-codfw.

This work will require coordination with the netops team for scheduling and execution.

  • Receive in and update procurement task T218794
  • Schedule work with @ayounsi and update this checklist with other steps.

Window steps:

  • Downtime alerting (Icinga/Librenms)
  • Ensure VRRP master is on the other node
  • Tune OSPF cost to drain transport links terminating on that device
  • Drain local BGP peers (graceful shutdown + deactivate)
  • Fail LVS over the ones connected to the other router
  • Ensure RE is connected to serial console
  • From the doc do Removing the Routing Engine and Installing the Routing Engine RE-S-X6-64G for the backup routing engine
  • Warn on IRC that it's going to be bumpy
  • From the doc do Verifying and Configuring the Upgraded Routing Engine as the Master All FPCs reboot after this step.
  • Verify device is healthy (logs, OSPF/BGP sessions, alarms, alerting)
  • Do the 5 steps above for the other RE (now backup)
  • Rollback BGP/OSPF/LVS/VRRP changes

Related Objects

StatusAssignedTask
Openayounsi

Event Timeline

RobH triaged this task as Normal priority.Jun 24 2019, 3:59 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptJun 24 2019, 3:59 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).Jun 24 2019, 3:59 PM
ayounsi updated the task description. (Show Details)Jun 25 2019, 7:04 AM

@ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know.

Thanks

Scheduled for the 31st at 15:00UTC (1h total).

Papaul updated the task description. (Show Details)Wed, Jul 31, 2:49 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-31T15:03:51Z] <XioNoX> power down re1:cr1-codfw (backup) - T226422

The new backup routing engine is not coming online.
Rolling back to the old one is not working neither.
Opened JTAC Service Request ID: 2019-0731-0446 .

First JTAC suggestion is to re-seat the SCB. We didn't do that today as the doc wasn't clear if it could be done with the router online.
JTAC is looking into the logs.

Also noticed the following while looking at the doc again today:

  1. Use the request chassis routing-engine master switch command to make the Routing Engine RE-S-X6-64G (RE1) the master Routing Engine. All FPCs reboot after this step.

So this mean the RE failover will have production impact, and should be prepared accordingly.

Mentioned in SAL (#wikimedia-operations) [2019-08-08T14:52:00Z] <XioNoX> continue cr1-codfw:re1 replacement - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-08T15:11:03Z] <XioNoX> commit synchronize on cr1-codfw - T226422

Papaul added a comment.Thu, Aug 8, 4:13 PM

Return information for bad SCB

ayounsi updated the task description. (Show Details)Thu, Aug 15, 8:58 PM