a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Jun 27 2019, 10:25 PM

Description

This task will track the migration of the ps1-eqiad and ps2-eqiad to be replaced with new PDUs in rack A1-eqiad.

Each server, switch, and router will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

The network racks have two individual PDU towers existing, and will be replaced with two new PDU towers, so this swap is easier than the majority of the row A/B PDU swaps (with their combined old PDU towers.)

- schedule downtime for the entire list of switches, routers, and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- connect via serial / confirm serial connection works
- setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
- update PDU model in puppet per T233129.

List of routers, switches, and servers

device	role	SRE team coordination
cr1-eqiad	primary router	@ayounsi
asw2-a1-eqiad	asw	@ayounsi
labsdb1009	labsdb	DBA	@Marostegui to depool this host and stop MySQL
kafka-jumbo1001	kafka	Analytics	Analytics has to monitor during window
dns1001	rec dns	Traffic	traffic has to depool and monitor during the window
db1069	db scheduled for decommission T227166	DBA	The host is powered OFF, waiting on-site decommission steps only. DO NOT POWER BACK ON
wdqs1006	wdqs	Discovery-ARCHIVED	@Gehel good to go
db1126	db	DBA	@Marostegui to depool this host before hand
analytics1058	hadoop worker	Analytics	hadoop workers need nothing done for this work. if they lose power, just power back on and let Analytics know

Details

	Subject	Repo	Branch	Lines +/-
	mariadb: Depool labsdb1009	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
Resolved	Jclark-ctr	T226782 a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC)
Resolved	Jclark-ctr	T233273 labsdb1009 broken PSU
		Unknown Object (Task)

Event Timeline

RobH created this task.Jun 27 2019, 10:25 PM

RobH updated the task description. (Show Details)

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).

elukey mentioned this in T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).Jul 16 2019, 9:59 AM

Good to go from the DB side

Marostegui updated the task description. (Show Details)Jul 23 2019, 10:24 AM

Seems like only 1 interface is master on cr1 the following is needed to fail it over

[edit interfaces ae2 unit 1202 family inet6 address 2620:0:861:202:fe00::1/64 vrrp-inet6-group 2]
+        priority 70;

cr1 going down would be noticeable, but OSPF is quick to failover (plus cr2 is the preferred path to codfw and esams), BGP we can pre-emptively disable the external peers but this will have some user facing impact (even though less than the device going down).
This would be a good use of BGP graceful shutdown (T211728).
As a power loss is quite unlikely if I understand correctly, I'd suggest to not do any routing changes. If anything goes bad we can fully depool cr2 when we tackle A8.

• MoritzMuehlenhoff updated the task description. (Show Details)Jul 24 2019, 3:16 PM

In T227139#5336328, @elukey wrote:

All the analytics nodes are hadoop workers, not a big deal if they loose power.

the above was on another task, but referenced same role as
analytics1058

+1 for analytics1058, kafka-jumbo1001 is also ok, just please ping me or ottomata when starting so we can monitor.

RobH updated the task description. (Show Details)Jul 24 2019, 3:21 PM

RobH updated the task description. (Show Details)Jul 24 2019, 3:49 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:56:21Z] <bblack> depooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:58:03Z] <XioNoX> failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:06Z] <bblack> lvs1014 - puppet disable, remove dns1001 from resolv.conf, restart pybal - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:59:58Z] <bblack> dns1001 - puppet disable, stop recursor service to kill anycast advert - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:10:52Z] <bblack> dns1001 - restart recursor and re-enable puppet - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T16:12:35Z] <bblack> re-pooling recdns on dns1001 via confctl - T226782

Mentioned in SAL (#wikimedia-operations) [2019-07-24T17:14:04Z] <XioNoX> rollback failover master VIP of ae2.1202 inet6 away from cr1-eqiad - T226782

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 7:18 PM

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

Marostegui updated the task description. (Show Details)Aug 5 2019, 4:12 PM

Marostegui updated the task description. (Show Details)Aug 6 2019, 8:06 AM

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:52 PM

wiki_willy renamed this task from a1-eqiad pdu refresh to a1-eqiad pdu refresh (Thursday 9/12 @11am UTC).Aug 15 2019, 5:30 PM

CDanis triaged this task as Medium priority.Aug 16 2019, 1:02 PM

Gehel updated the task description. (Show Details)Aug 19 2019, 4:12 PM

Gehel subscribed.

As this rack has one of our 2 most important routers I'd like to be around for the maintenance.
11am UTC is 4am pacific. It would be ideal if it could be pushed at least to 8am pacific, 15UTC.
Otherwise please make sure Mark or Faidon can be there.

• Phamhi subscribed.Aug 27 2019, 4:11 PM

wiki_willy assigned this task to • Cmjohnson.Sep 9 2019, 3:56 PM

Per SRE meeting, we'll be rescheduling the PDU upgrades for this rack to a later date TBA due to a lot of the ongoing work related to the recent outages.

wiki_willy renamed this task from a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) to a1-eqiad pdu refresh (Date TBD).Sep 9 2019, 4:44 PM

This is alerting: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a1-eqiad
And this: https://netbox.wikimedia.org/extras/reports/librenms.LibreNMS/

New date for upgrading the remaining PDU on the network rack A1 will be targeting Tuesday, 10/15 at 11am UTC. Thanks, Willy

@wiki_willy labsdb1009 has a broken PSU (T233273), I think the new one will arrive before this maintenance although it has not been ordered yet (T233277#5532768), but if it doesn't arrive, we'll need to poweroff this host and not only MySQL.
Could you follow up with Pam to see if we can order the PSU so it arrives on time for this maintenance?

Thanks!

Marostegui added a subtask: T233273: labsdb1009 broken PSU.Oct 1 2019, 6:16 AM

@Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009. Thanks, Willy

Excellent, thanks!

In T226782#5536948, @wiki_willy wrote:

@Marostegui - sure, will do. This week is the approval & ordering phase of the procurement cycle, so it shouldn't be an issue getting the PO submitted for labsdb1009. Thanks, Willy

I believe this wasn't ordered last week no?
So I don't think it will arrive before the 15th so we'd need to power this host down for this maintenance I guess :(

Marostegui updated the task description. (Show Details)Oct 7 2019, 7:34 AM

@Marostegui - it was ordered last Friday morning. We haven't received the tracking number from the vendor yet, but will update that in T233277 once provided. There's still a chance it arrives before the 15th, but we should know have an ETA soon. Thanks, Willy

Ah - thanks, as T233277 wasn't updated I thought it wasn't ordered. Let's see what the ETA is. Thanks for the update

RobH updated the task description. (Show Details)Oct 11 2019, 8:40 PM

Jclark-ctr closed subtask T233273: labsdb1009 broken PSU as Resolved.Oct 14 2019, 1:23 PM

Change 543023 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Depool labsdb1009

https://gerrit.wikimedia.org/r/543023

Change 543023 merged by Marostegui:
[operations/puppet@production] mariadb: Depool labsdb1009

https://gerrit.wikimedia.org/r/543023

Mentioned in SAL (#wikimedia-operations) [2019-10-15T05:38:25Z] <marostegui> Depool labsdb1009 for PDU maintenance T226782

Maintenance_bot removed a project: Patch-For-Review.Oct 15 2019, 6:10 AM

Marostegui updated the task description. (Show Details)Oct 15 2019, 7:05 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-15T07:10:43Z] <XioNoX> failover VRRP from cr1-eqiad to cr2-eqiad in prevision of the PDU work of - T226782

Mentioned in SAL (#wikimedia-operations) [2019-10-15T07:13:40Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1126 for PDU maintenance T226782', diff saved to https://phabricator.wikimedia.org/P9345 and previous config saved to /var/cache/conftool/dbconfig/20191015-071338-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-15T08:07:13Z] <marostegui> Stop MySQL on db1126 and labsdb1009 for PDU maintenance - T226782

db1126 and labsdb1009 are ok to proceed.
Note: db1069 has its power OFF as it is pending on-site decommissioning steps. DO NOT power it back on

Marostegui updated the task description. (Show Details)Oct 15 2019, 8:09 AM

Starting pdu upgrade

Jclark-ctr claimed this task.Oct 15 2019, 12:10 PM

Jclark-ctr updated the task description. (Show Details)

the PDU swap is over. Nothing lost while swapping PDU. Everything is cabled and they're linked together. Netbox is updated. still needs PDU configuration completed .

Mentioned in SAL (#wikimedia-operations) [2019-10-16T13:46:25Z] <XioNoX> rollback failover VRRP from cr1-eqiad to cr2-eqiad - T226782

My understanding of this task state is as follows:

@Jclark-ctr had to emergency swap out ps1-a1-eqiad due to a failure
he left the old ps2-a1-eqiad in place, unlinked to ps1 (as its an old model) so power was not interrupted

we need to schedule time to swap ps2.

Is this correct? (Directed at @Jclark-ctr & @Cmjohnson)

@RobH - ps2 was swapped last Tuesday on 10/15

Ok, just logged in and confirmed the ps1 sees ps2. the rest was already configured from our deployment of ps1 except the model hadn't been updated.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/545337

now it has, resolving this task.

RobH updated the task description. (Show Details)Oct 22 2019, 4:57 PM

a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC)Closed, ResolvedPublicActions