Page MenuHomePhabricator

(Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad
Closed, ResolvedPublic0 Story Points

Description

This task will track the installation of (4) new RE-S-X6-64G-S modules for installation/swap (2 in each) in cr[12]-eqiad.

This work will require coordination with the netops team for scheduling and execution.

  • - receive in and update procurement task T223318
  • - schedule work with @ayounsi and update this checklist with other steps.

Window steps:

  • Downtime alerting (Icinga/Librenms)
  • Ensure VRRP master is on the other node
  • Tune OSPF cost to drain transport links terminating on that device (cr1 only have backup links)
  • Drain local BGP peers (graceful shutdown + deactivate) (including frack) (in this case, CF tunnels)
  • Fail LVS over the ones connected to the other router
  • Ensure RE is connected to serial console
  • From the doc do Removing the Routing Engine and Installing the Routing Engine RE-S-X6-64G for the backup routing engine
  • Warn on IRC that it's going to be bumpy
  • From the doc do Verifying and Configuring the Upgraded Routing Engine as the Master All FPCs reboot after this step.
  • Verify device is healthy (logs, OSPF/BGP sessions, alarms, alerting)
  • Do the 5 steps above for the other RE (now backup)
  • Rollback BGP/OSPF/LVS/VRRP changes

Event Timeline

RobH triaged this task as Normal priority.Jun 24 2019, 4:00 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptJun 24 2019, 4:00 PM
RobH added a parent task: Unknown Object (Task).Jun 24 2019, 4:00 PM
ayounsi updated the task description. (Show Details)Jun 25 2019, 8:14 AM
RobH moved this task from Backlog to Hardware deployments on the netops board.Jun 28 2019, 4:48 PM
RobH moved this task from Backlog to Racking Tasks on the ops-eqiad board.
wiki_willy renamed this task from update RE-S-X6-64G-S in cr[12]-eqiad to (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad.Jul 2 2019, 10:36 PM
RobH moved this task from Racking Tasks to Blocked on the ops-eqiad board.Jul 24 2019, 1:53 PM

@ayounsi:

This now has a need by date of September 30th (I assume you and @wiki_willy came up with that as he added it?)

This is basically blocked on netops telling DC-Ops when you want to schedule this work. Please advise,

ayounsi updated the task description. (Show Details)Aug 21 2019, 7:54 PM

Scheduled for Thursday Sept 5th, 8am PST, 11am local time, 15:00 UTC. 3h

Postponed to Thursday Sept 12th, 8am PST, 11am local time, 15:00 UTC. 3h

Mentioned in SAL (#wikimedia-operations) [2019-09-12T14:29:28Z] <XioNoX> ensure cr1-eqiad is vrrp backup for all groups - T226424

ayounsi updated the task description. (Show Details)Sep 12 2019, 2:30 PM

Change 536209 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Temporarily connect all eqiad pybal to cr2

https://gerrit.wikimedia.org/r/536209

Change 536209 merged by BBlack:
[operations/puppet@production] Temporarily connect all eqiad pybal to cr2

https://gerrit.wikimedia.org/r/536209

Mentioned in SAL (#wikimedia-operations) [2019-09-12T14:50:45Z] <bblack> restart pybal on lvs1016 to move BGP conn to cr2-eqiad - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536209 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T14:53:20Z] <bblack> restart pybal on lvs1013 to move BGP conn to cr2-eqiad - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536209 - T226424

ayounsi updated the task description. (Show Details)Sep 12 2019, 3:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-12T15:13:37Z] <XioNoX> shutdown re1.cr1-eqiad - T226424

RobH removed a subscriber: RobH.Sep 12 2019, 3:19 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-12T15:22:59Z] <XioNoX> failover master RE from RE0 to RE1 on cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T15:31:46Z] <XioNoX> shutdown re0.cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T15:39:31Z] <XioNoX> deactivate transit4/6 on cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T15:45:15Z] <XioNoX> failover master RE from RE1 to RE0 on cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:04:18Z] <XioNoX> reboot cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:15:59Z] <XioNoX> activate transit4/6 on cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:16:36Z] <XioNoX> activate CF tunnel on cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:19:53Z] <XioNoX> rollback force VRRP backup on cr1-eqiad - T226424

Change 536211 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Temporarily connect all eqiad pybal to cr1

https://gerrit.wikimedia.org/r/536211

Change 536211 merged by BBlack:
[operations/puppet@production] Temporarily connect all eqiad pybal to cr1

https://gerrit.wikimedia.org/r/536211

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:34:30Z] <bblack> lvs1016: restart pybal to move bgp session to cr1 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:35:42Z] <bblack> lvs1015: restart pybal to move bgp session to cr1 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:36:19Z] <bblack> lvs1014: restart pybal to move bgp session to cr1 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:36:37Z] <bblack> lvs1013: restart pybal to move bgp session to cr1 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:42:13Z] <XioNoX> switch VRRP master to cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:42:33Z] <XioNoX> er, switch VRRP master to cr1-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T16:49:03Z] <XioNoX> Deactivate IX/transit/private-peer v4/v6 BGP on cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:00:08Z] <XioNoX> +1000 metric to all transport to/from cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:04:11Z] <XioNoX> power off re1.cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:24:59Z] <XioNoX> failover cr2-eqiad master RE from RE0 to RE1 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:31:30Z] <XioNoX> power off re0.cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:40:30Z] <XioNoX> failover cr2-eqiad master RE from RE1 to RE0 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:43:15Z] <XioNoX> reboot cr2-eqiad - T226424

Change 536303 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "Temporarily connect all eqiad pybal to cr1"

https://gerrit.wikimedia.org/r/536303

Change 536304 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "Temporarily connect all eqiad pybal to cr2"

https://gerrit.wikimedia.org/r/536304

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:53:30Z] <XioNoX> re-enabled external BGP on cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:54:03Z] <XioNoX> revert OSPF priority change on cr2-eqiad - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T17:58:18Z] <XioNoX> revert VRRP priority change cr2-eqiad - T226424

Change 536303 merged by BBlack:
[operations/puppet@production] Revert "Temporarily connect all eqiad pybal to cr1"

https://gerrit.wikimedia.org/r/536303

Change 536304 merged by BBlack:
[operations/puppet@production] Revert "Temporarily connect all eqiad pybal to cr2"

https://gerrit.wikimedia.org/r/536304

Mentioned in SAL (#wikimedia-operations) [2019-09-12T18:03:53Z] <bblack> lvs1014: restart pybal to return BGP session to cr2 - T226424

Mentioned in SAL (#wikimedia-operations) [2019-09-12T18:04:02Z] <bblack> lvs1015: restart pybal to return BGP session to cr2 - T226424

ayounsi closed this task as Resolved.Sep 12 2019, 6:14 PM

Alright everything here is done. And was quite smooth.
Some notes:

  • k8s1005 and k8s1006 only had v4/v6 sessions to cr1 and not cr2, which caused this page

PROBLEM - LVS HTTP IPv4 #page on sessionstore.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Was fixed quickly and the service is not in prod yet

  • VRRP failover triggered the following for eqiad/codfw/ulsfo, Not ideal but not critical neither

PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo

  • text LVS and its backup (1016) LVS are on the same cr1, needs to have T180069 in prod urgently as it's a SPOF
  • CF tunnel failover worked as expected