Page MenuHomePhabricator

update RE-S-X6-64G-S in cr[12]-codfw
Closed, ResolvedPublic0 Estimated Story Points

Description

This task will track the installation of (4) new RE-S-X6-64G-S modules for installation/swap (2 in each) in cr[12]-codfw.

This work will require coordination with the netops team for scheduling and execution.

  • Receive in and update procurement task T218794
  • Schedule work with @ayounsi and update this checklist with other steps.

Window steps:

  • Downtime alerting (Icinga/Librenms)
  • Ensure VRRP master is on the other node
  • Tune OSPF cost to drain transport links terminating on that device
  • Drain local BGP peers (graceful shutdown + deactivate)
  • Fail LVS over the ones connected to the other router
  • Ensure RE is connected to serial console
  • From the doc do Removing the Routing Engine and Installing the Routing Engine RE-S-X6-64G for the backup routing engine
  • Warn on IRC that it's going to be bumpy
  • From the doc do Verifying and Configuring the Upgraded Routing Engine as the Master All FPCs reboot after this step.
  • Verify device is healthy (logs, OSPF/BGP sessions, alarms, alerting)
  • Do the 5 steps above for the other RE (now backup)
  • Rollback BGP/OSPF/LVS/VRRP changes

Event Timeline

RobH triaged this task as Medium priority.Jun 24 2019, 3:59 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a parent task: Unknown Object (Task).Jun 24 2019, 3:59 PM

@ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know.

Thanks

Scheduled for the 31st at 15:00UTC (1h total).

Mentioned in SAL (#wikimedia-operations) [2019-07-31T15:03:51Z] <XioNoX> power down re1:cr1-codfw (backup) - T226422

The new backup routing engine is not coming online.
Rolling back to the old one is not working neither.
Opened JTAC Service Request ID: 2019-0731-0446 .

First JTAC suggestion is to re-seat the SCB. We didn't do that today as the doc wasn't clear if it could be done with the router online.
JTAC is looking into the logs.

Also noticed the following while looking at the doc again today:

  1. Use the request chassis routing-engine master switch command to make the Routing Engine RE-S-X6-64G (RE1) the master Routing Engine. All FPCs reboot after this step.

So this mean the RE failover will have production impact, and should be prepared accordingly.

Mentioned in SAL (#wikimedia-operations) [2019-08-08T14:52:00Z] <XioNoX> continue cr1-codfw:re1 replacement - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-08T15:11:03Z] <XioNoX> commit synchronize on cr1-codfw - T226422

Return information for bad SCB

Change 531502 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for routers work

https://gerrit.wikimedia.org/r/531502

Change 531513 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Varnish: redirect eqsin/ulsfo text to eqiad

https://gerrit.wikimedia.org/r/531513

Change 531502 merged by Ayounsi:
[operations/dns@master] Depool codfw and eqsin for codfw routers work

https://gerrit.wikimedia.org/r/531502

Mentioned in SAL (#wikimedia-operations) [2019-08-21T16:46:26Z] <XioNoX> apply BGP graceful shutdown to cr1-codfw transits - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T16:51:12Z] <XioNoX> increase OSPF cost on ulsfo-codfw link - T226422

Change 531513 merged by Ayounsi:
[operations/puppet@production] Varnish: redirect eqsin/ulsfo text to eqiad

https://gerrit.wikimedia.org/r/531513

Mentioned in SAL (#wikimedia-operations) [2019-08-21T16:56:22Z] <XioNoX> Varnish: redirect eqsin/ulsfo text to eqiad - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:08:47Z] <XioNoX> disable BGP from cr1-codfw to lvs2001/2/3 - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:17:11Z] <XioNoX> failover master RE to RE1 on cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:25:50Z] <XioNoX> shutdown RE0 on cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:33:33Z] <XioNoX> failover master RE to RE0 on cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:43:02Z] <XioNoX> restart both REs on cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:53:23Z] <XioNoX> rollback: disable BGP from cr1-codfw to lvs2001/2/3 - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:55:03Z] <XioNoX> Rollback: increase OSPF cost on ulsfo-codfw link - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T17:56:04Z] <XioNoX> rollback: apply BGP graceful shutdown to cr1-codfw transits - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:04:32Z] <XioNoX> increase OSPF cost on cr2-codfw links - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:12:25Z] <XioNoX> deactivate transit links on cr2-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:17:53Z] <XioNoX> move VRRP master from cr2-codfw to cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:19:26Z] <XioNoX> shutdown re1:cr2-codfw (backup) - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:32:23Z] <XioNoX> failover master RE to RE1 on cr2-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T18:37:03Z] <XioNoX> shutdown re0:cr2-codfw (backup) - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:14:13Z] <XioNoX> failover master RE to RE0 on cr2-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:16:25Z] <XioNoX> restart both REs on cr2-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:24:53Z] <XioNoX> rollback: move VRRP master from cr2-codfw to cr1-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:25:40Z] <XioNoX> rollback deactivate transit links on cr2-codfw - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:26:14Z] <XioNoX> rollback: increase OSPF cost on cr2-codfw links - T226422

Mentioned in SAL (#wikimedia-operations) [2019-08-21T19:31:41Z] <XioNoX> Rollback: Varnish: redirect eqsin/ulsfo text to eqiad - T226422

DONE!
Everything is healthy, very little alert noise, no service impact.