Page MenuHomePhabricator

a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1-eqiad and ps2-eqiad to be replaced with new PDUs in rack A1-eqiad.

Each server, switch, and router will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

The network racks have two individual PDU towers existing, and will be replaced with two new PDU towers, so this swap is easier than the majority of the row A/B PDU swaps (with their combined old PDU towers.)

  • - schedule downtime for the entire list of switches, routers, and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - connect via serial / confirm serial connection works
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129.

List of routers, switches, and servers

deviceroleSRE team coordination
cr1-eqiadprimary router@ayounsi
asw2-a1-eqiadasw@ayounsi
labsdb1009labsdbDBA@Marostegui to depool this host and stop MySQL
kafka-jumbo1001kafkaAnalyticsAnalytics has to monitor during window
dns1001rec dnsTraffictraffic has to depool and monitor during the window
db1069db scheduled for decommission T227166DBAThe host is powered OFF, waiting on-site decommission steps only. DO NOT POWER BACK ON
wdqs1006wdqsDiscovery-ARCHIVED@Gehel good to go
db1126dbDBA@Marostegui to depool this host before hand
analytics1058hadoop workerAnalyticshadoop workers need nothing done for this work. if they lose power, just power back on and let Analytics know

Event Timeline

Seems like only 1 interface is master on cr1 the following is needed to fail it over

[edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2]
+        priority 70;

cr1 going down would be noticeable, but OSPF is quick to failover (plus cr2 is the preferred path to codfw and esams), BGP we can pre-emptively disable the external peers but this will have some user facing impact (even though less than the device going down).
This would be a good use of BGP graceful shutdown (T211728).
As a power loss is quite unlikely if I understand correctly, I'd suggest to not do any routing changes. If anything goes bad we can fully depool cr2 when we tackle A8.

All the analytics nodes are hadoop workers, not a big deal if they loose power.

the above was on another task, but referenced same role as
analytics1058

+1 for analytics1058, kafka-jumbo1001 is also ok, just please ping me or ottomata when starting so we can monitor.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:56:21Z] <bblack> depooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:58:03Z] <XioNoX> failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:06Z] <bblack> lvs1014 - puppet disable, remove dns1001 from resolv.conf, restart pybal - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:58Z] <bblack> dns1001 - puppet disable, stop recursor service to kill anycast advert - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:10:52Z] <bblack> dns1001 - restart recursor and re-enable puppet - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:12:35Z] <bblack> re-pooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T17:14:04Z] <XioNoX> rollback failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:52 PM
wiki_willy renamed this task from a1-eqiad pdu refresh to a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).Aug 15 2019, 5:30 PM
CDanis triaged this task as Medium priority.Aug 16 2019, 1:02 PM

As this rack has one of our 2 most important routers I'd like to be around for the maintenance.
11am UTC is 4am pacific. It would be ideal if it could be pushed at least to 8am pacific, 15UTC.
Otherwise please make sure Mark or Faidon can be there.

Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages.

wiki_willy renamed this task from a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) to a1-eqiad pdu refresh (Date TBD).Sep 9 2019, 4:44 PM
wiki_willy renamed this task from a1-eqiad pdu refresh (Date TBD) to a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC.Sep 30 2019, 5:13 PM
wiki_willy renamed this task from a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC to a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC).

New date for upgrading the remaining PDU on the network rack A1 will be targeting Tuesday, 10/15 at 11am UTC. Thanks, Willy

@wiki_willy labsdb1009 has a broken PSU (T233273), I think the new one will arrive before this maintenance although it has not been ordered yet (T233277#5532768), but if it doesn't arrive, we'll need to poweroff this host and not only MySQL.
Could you follow up with Pam to see if we can order the PSU so it arrives on time for this maintenance?

Thanks!

@Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009. Thanks, Willy

@Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009. Thanks, Willy

I believe this wasn't ordered last week no?
So I don't think it will arrive before the 15th so we'd need to power this host down for this maintenance I guess :(

@Marostegui - it was ordered last Friday morning. We haven't received the tracking number from the vendor yet, but will update that in T233277 once provided. There's still a chance it arrives before the 15th, but we should know have an ETA soon. Thanks, Willy

Ah - thanks, as T233277 wasn't updated I thought it wasn't ordered. Let's see what the ETA is. Thanks for the update

Change 543023 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Depool labsdb1009

https://gerrit.wikimedia.org/r/543023

Change 543023 merged by Marostegui:
[operations/puppet@production] mariadb: Depool labsdb1009

https://gerrit.wikimedia.org/r/543023

Mentioned in SAL (#wikimedia-operations) [2019-10-15T05:38:25Z] <marostegui> Depool labsdb1009 for PDU maintenance T226782

Mentioned in SAL (#wikimedia-operations) [2019-10-15T07:10:43Z] <XioNoX> failover VRRP from cr1-eqiad to cr2-eqiad in prevision of the PDU work of - T226782

Mentioned in SAL (#wikimedia-operations) [2019-10-15T07:13:40Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1126 for PDU maintenance T226782', diff saved to https://phabricator.wikimedia.org/P9345 and previous config saved to /var/cache/conftool/dbconfig/20191015-071338-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-15T08:07:13Z] <marostegui> Stop MySQL on db1126 and labsdb1009 for PDU maintenance - T226782

db1126 and labsdb1009 are ok to proceed.
Note: db1069 has its power OFF as it is pending on-site decommissioning steps. DO NOT power it back on

the PDU swap is over. Nothing lost while swapping PDU. Everything is cabled and they're linked together. Netbox is updated. still needs PDU configuration completed .

Mentioned in SAL (#wikimedia-operations) [2019-10-16T13:46:25Z] <XioNoX> rollback failover VRRP from cr1-eqiad to cr2-eqiad - T226782

My understanding of this task state is as follows:

  • @Jclark-ctr had to emergency swap out ps1-a1-eqiad due to a failure
  • he left the old ps2-a1-eqiad in place, unlinked to ps1 (as its an old model) so power was not interrupted
  • we need to schedule time to swap ps2.

Is this correct? (Directed at @Jclark-ctr & @Cmjohnson)

@RobH - ps2 was swapped last Tuesday on 10/15

Ok, just logged in and confirmed the ps1 sees ps2. the rest was already configured from our deployment of ps1 except the model hadn't been updated.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/545337

now it has, resolving this task.