Page MenuHomePhabricator

a1-eqiad pdu refresh (Thursday 9/12 @11am UTC)
Open, NormalPublic

Description

This task will track the migration of the ps1-eqiad and ps2-eqiad to be replaced with new PDUs in rack A1-eqiad.

Each server, switch, and router will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

The network racks have two individual PDU towers existing, and will be replaced with two new PDU towers, so this swap is easier than the majority of the row A/B PDU swaps (with their combined old PDU towers.)

  • - schedule downtime for the entire list of switches, routers, and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordination
cr1-eqiadprimary router@ayounsi
asw2-a1-eqiadasw@ayounsi
labsdb1009labsdbDBA@Marostegui to depool this host and stop MySQL
kafka-jumbo1001kafkaAnalyticsAnalytics has to monitor during window
dns1001rec dnsTraffictraffic has to depool and monitor during the window
db1069db scheduled for decommission T227166DBA
wdqs1006wdqsDiscovery@Gehel good to go
db1126dbDBA@Marostegui to depool this host before hand
analytics1058hadoop workerAnalyticshadoop workers need nothing done for this work. if they lose power, just power back on and let Analytics know

Event Timeline

Good to go from the DB side

Marostegui updated the task description. (Show Details)Jul 23 2019, 10:24 AM

Seems like only 1 interface is master on cr1 the following is needed to fail it over

[edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2]
+        priority 70;

cr1 going down would be noticeable, but OSPF is quick to failover (plus cr2 is the preferred path to codfw and esams), BGP we can pre-emptively disable the external peers but this will have some user facing impact (even though less than the device going down).
This would be a good use of BGP graceful shutdown (T211728).
As a power loss is quite unlikely if I understand correctly, I'd suggest to not do any routing changes. If anything goes bad we can fully depool cr2 when we tackle A8.

RobH added a subscriber: elukey.Wed, Jul 24, 3:19 PM

All the analytics nodes are hadoop workers, not a big deal if they loose power.

the above was on another task, but referenced same role as
analytics1058

+1 for analytics1058, kafka-jumbo1001 is also ok, just please ping me or ottomata when starting so we can monitor.

RobH updated the task description. (Show Details)Wed, Jul 24, 3:21 PM
RobH updated the task description. (Show Details)Wed, Jul 24, 3:49 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:56:21Z] <bblack> depooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:58:03Z] <XioNoX> failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:06Z] <bblack> lvs1014 - puppet disable, remove dns1001 from resolv.conf, restart pybal - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:58Z] <bblack> dns1001 - puppet disable, stop recursor service to kill anycast advert - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:10:52Z] <bblack> dns1001 - restart recursor and re-enable puppet - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:12:35Z] <bblack> re-pooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T17:14:04Z] <XioNoX> rollback failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Wed, Jul 24, 7:18 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Fri, Jul 26, 1:37 PM
Marostegui updated the task description. (Show Details)Mon, Aug 5, 4:12 PM
Marostegui updated the task description. (Show Details)Tue, Aug 6, 8:06 AM
RobH removed RobH as the assignee of this task.Wed, Aug 14, 4:52 PM
wiki_willy renamed this task from a1-eqiad pdu refresh to a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).Thu, Aug 15, 5:30 PM
CDanis triaged this task as Normal priority.Fri, Aug 16, 1:02 PM
Gehel updated the task description. (Show Details)Mon, Aug 19, 4:12 PM
Gehel added a subscriber: Gehel.