Page MenuHomePhabricator

a4-eqiad pdu refresh
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A4-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - setup all remote configuration options for new pdu. (network, snmp, login, etc...)

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a4-eqiadasw@ayounsi
rdb1003
ms-be1046ms-be@fgiunchedi
stat1004analyticsAnalytics
kafka1001kafka@herron
restbase1007@fgiunchedi
wdqs1003wdqs
labstore1006 (and two arrays)labstorecloud-services-team
ganeti1005ganeti node@akosiarishost will need to be emptied in advance
contint1001#rel-eng
oresrdb1002ores@akosiarisFine to reboot at anytime. Caution: Not the case with oresrdb1001
netmon1002
lvs1003lvsTraffic
lvs1002lvsTraffic
lvs1001lvsTraffic
cp1076cpTraffic
rhenium
db1111dbDBA
conf1004zookeeper/etcdserviceops Analytics
ms-fe1006ms-fe@fgiunchedi
cp1075cpTraffic
labservices1002labservicescloud-services-team
an-worker1080analyticsAnalytics
maps1001
oxygen
analytics1070analyticsAnalytics
snapshot1005
kubestage1001kubernetes stagingserviceops
logstash1004
scb1001
aqs1004analyticsAnalytics
druid1001analyticsAnalytics

Event Timeline

RobH updated the task description. (Show Details)Jul 2 2019, 9:11 PM
RobH added subscribers: ayounsi, akosiaris, fgiunchedi.
elukey updated the task description. (Show Details)Jul 16 2019, 10:01 AM
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)
elukey added a subscriber: herron.
elukey added a subscriber: elukey.Jul 16 2019, 10:04 AM

I replaced the Analytics tag for kafka1001 with @herron since the kafka main cluster is now handled by infrastructure foundations.

I also added some Analytics tags, and added @akosiaris for conf1004 since it runs both zookeeper (hadoop + all kafkas) and etcd (conftool, etc..).

The stat1004 will need a heads up email to people using it (researchers/analysts, etc..) just as FYI, will take care of it.

BBlack added a subscriber: BBlack.EditedJul 22 2019, 2:46 PM

cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100[123] - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)

RobH added a comment.Jul 22 2019, 2:47 PM

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

ms-fe instead is ok to depool from lvs, power can stay on as these hosts are stateless anyways

elukey updated the task description. (Show Details)Jul 22 2019, 4:39 PM

Just depooled aqs1004

Mentioned in SAL (#wikimedia-operations) [2019-07-22T17:22:55Z] <herron> depooling kafka1001 for PDU work T227140

Mentioned in SAL (#wikimedia-operations) [2019-07-22T17:35:45Z] <elukey> depool scb1001 for PDU work T227140

RobH updated the task description. (Show Details)Jul 22 2019, 6:44 PM
RobH added a comment.Jul 22 2019, 6:57 PM

ms-be1046 rebooted and back online

ms-fe1006 repooled

cp1075 repooled

Mentioned in SAL (#wikimedia-operations) [2019-07-22T18:59:13Z] <herron> repooling kafka1001 T227140

RobH closed this task as Resolved.Jul 22 2019, 7:02 PM

@elukey repooled aqs1004 and scb1001

the upgrade of a4-eqiad pdus is done.

RobH updated the task description. (Show Details)Jul 22 2019, 7:02 PM
akosiaris updated the task description. (Show Details)Jul 23 2019, 6:51 AM
akosiaris added a subscriber: MoritzMuehlenhoff.

sudo gnt-node migrate -f ganeti1005

  • conf1004 should be fully powered down, desired actions perform and then powered on again (it will repool itself automatically)

Sigh, this was already done. I just hope the info added will be useful at some point in the future as a guide

RobH added a comment.Jul 23 2019, 2:20 PM

Please note that netmon and kubestage both powered off yesterday (irc update about this) so we didn't have a flawless migration.

RobH removed RobH as the assignee of this task.Aug 28 2019, 6:40 PM