This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B7-eqiad.
Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.
These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.
- - schedule downtime for the entire list of switches and servers.
- - before work starts, silence all icinga alerts until 8PM GMT same day
- - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- - Once new PDU tower is confirmed online, move on to next steps.
- - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- - connect via serial / confirm serial connection works
- - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
- - update PDU model in puppet per T233129.
- - clear icinga errors for missing ps2 input by connecting/checking connection of the rj11 cable connection between ps1 and ps2 b7-eqiad. Once it is connected, the icinga errors for the tower B infeed will clear up.
List of routers, switches, and servers
device | role | SRE team coordination | recommended action during maintainance |
asw-b7-eqiad | asw | @ayounsi | ensure this doesn't go offline as it will take entire rack network offline |
wtp1033 | |||
wtp1032 | |||
wtp1031 | |||
kafka-main1002 | @herron | To avoid alert noise from adjacent kafka-main hosts, schedule icinga downtime for "Kafka Broker Under Replicated Partitions" service on kafka-main100[123] as well. Perform graceful shutdown of server before maintenance, and ensure powered up when completed. | |
dbprov1002 | db provisioning/backup generation host | DBA | Really nothing to do, but @jcrespo will keep an eye on it |
cloudvirtan1005 | |||
cloudvirtan1004 | |||
an-worker1087 | @Nuria | ||
an-worker1086 | @Nuria | ||
cp1082 | cp system | Traffic | T227542#5355289 |
cp1081 | cp system | Traffic | T227542#5355289 |
ms-be1041 | ms-be system | fillipo | gracefully shutdown the host just before rack maintainance, and power it back online post-maintainance. |
cloudvirt1022 | cloudvirt host | cloud-services-team | @JHedden No running VMs, can happen anytime |
analytics1073 | Analytics | fine to do any time | |
lvs1014 | lvs system | @BBlack | T227542#5355289 |
cloudvirt1020 | cloudvirt host | cloud-services-team | @JHedden has running VMs please handle with care |
druid1005 | Analytics | fine to do any time | |
ores1003 | |||
cloudnet1003 | cloud-services-team | @JHedden is active but it has a redundant peer | |
restbase-dev1005 | |||
cloudcontrol1004 | cloud-services-team | @JHedden is active but it has a redundant peer | |
cloudvirt1017 | cloudvirt | cloud-services-team | @JHedden has a large number of running VMs, please handle with care |
mw1318 | mw server | @Joe | |
mw1317 | mw server | @Joe | |
mw1316 | mw server | @Joe | |
mw1315 | mw server | @Joe | |
mw1314 | mw server | @Joe | |
mw1313 | mw server | @Joe |