a4-eqiad pdu refresh
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	RobH
	Jul 2 2019, 8:00 PM

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A4-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

- schedule downtime for the entire list of switches and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- setup all remote configuration options for new pdu. (network, snmp, login, etc...)

List of routers, switches, and servers

device	role	SRE team coordination	notes
asw2-a4-eqiad	asw	@ayounsi
rdb1003
ms-be1046	ms-be	@fgiunchedi
stat1004	analytics	Analytics
kafka1001	kafka	@herron
restbase1007		@fgiunchedi
wdqs1003	wdqs
labstore1006 (and two arrays)	labstore	cloud-services-team
ganeti1005	ganeti node	@akosiaris	host will need to be emptied in advance
contint1001		#rel-eng
oresrdb1002	ores	@akosiaris	Fine to reboot at anytime. Caution: Not the case with oresrdb1001
netmon1002
lvs1003	lvs	Traffic
lvs1002	lvs	Traffic
lvs1001	lvs	Traffic
cp1076	cp	Traffic
rhenium
db1111	db	DBA
conf1004	zookeeper/etcd	serviceops Analytics
ms-fe1006	ms-fe	@fgiunchedi
cp1075	cp	Traffic
labservices1002	labservices	cloud-services-team
an-worker1080	analytics	Analytics
maps1001
oxygen
analytics1070	analytics	Analytics
snapshot1005
kubestage1001	kubernetes staging	serviceops
logstash1004
scb1001
aqs1004	analytics	Analytics
druid1001	analytics	Analytics

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
		Resolved		None	T227140 a4-eqiad pdu refresh

Event Timeline

RobH created this task.Jul 2 2019, 8:00 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).

RobH updated the task description. (Show Details)Jul 2 2019, 9:11 PM

RobH added subscribers: ayounsi, akosiaris, fgiunchedi.

elukey updated the task description. (Show Details)Jul 16 2019, 10:01 AM

elukey updated the task description. (Show Details)

elukey added a subscriber: herron.

I replaced the Analytics tag for kafka1001 with @herron since the kafka main cluster is now handled by infrastructure foundations.

I also added some Analytics tags, and added @akosiaris for conf1004 since it runs both zookeeper (hadoop + all kafkas) and etcd (conftool, etc..).

The stat1004 will need a heads up email to people using it (researchers/analysts, etc..) just as FYI, will take care of it.

cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100[123] - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)

In T226778#5354000, @RobH wrote:

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

In T227140#5354007, @RobH wrote:

In T226778#5354000, @RobH wrote:

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

ms-fe instead is ok to depool from lvs, power can stay on as these hosts are stateless anyways

elukey updated the task description. (Show Details)Jul 22 2019, 4:39 PM

Just depooled aqs1004

Mentioned in SAL (#wikimedia-operations) [2019-07-22T17:22:55Z] <herron> depooling kafka1001 for PDU work T227140

Mentioned in SAL (#wikimedia-operations) [2019-07-22T17:35:45Z] <elukey> depool scb1001 for PDU work T227140

RobH updated the task description. (Show Details)Jul 22 2019, 6:44 PM

ms-be1046 rebooted and back online

ms-fe1006 repooled

cp1075 repooled

Mentioned in SAL (#wikimedia-operations) [2019-07-22T18:59:13Z] <herron> repooling kafka1001 T227140

@elukey repooled aqs1004 and scb1001

the upgrade of a4-eqiad pdus is done.

RobH updated the task description. (Show Details)Jul 22 2019, 7:02 PM

emptying ganeti1005 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#Node_operations using:

sudo gnt-node migrate -f ganeti1005

conf1004 should be fully powered down, desired actions perform and then powered on again (it will repool itself automatically)

Sigh, this was already done. I just hope the info added will be useful at some point in the future as a guide

fgiunchedi mentioned this in T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).Jul 23 2019, 9:12 AM

Please note that netmon and kubestage both powered off yesterday (irc update about this) so we didn't have a flawless migration.

RobH removed RobH as the assignee of this task.Aug 28 2019, 6:40 PM

a4-eqiad pdu refreshClosed, ResolvedPublicActions

Description

List of routers, switches, and servers

Related ObjectsSearch...

Event Timeline

a4-eqiad pdu refresh
Closed, ResolvedPublic
Actions

Related Objects
Search...