This task will track the work required to prepare/stage, and then swap out the failed PDU tower in a2-eqiad. Details are as follows:
* ps1-a2-eqiad is a dual input (tower A and B combined in a single PDU chassis) with 24 ports per tower.
* ps1-a2-eqiad has had failures occur on its phases, either due to the PDU failing, or due to phase imbalance that cannot be corrected due to the limited number of power plugs per tower (only 24).
** Chris will swap out the existing/failing ps1-a2-eqiad and put in a spare dual wide, 42 port per tower PDU. This isn't as ideal as a brand new PDU (via T210776), but the new PDU has a 30 day lead time.
* All systems in a2-eqiad have to be reviewed, as downtime could result.
** All precautions will be taken to try to migrate PDUs without downtime, but nothing is a certainty when dealing with the power feeds into our rack.
[] - list off all systems in a2-eqiad, check with service owners and schedule a downtime date before Chris leaves for all hands.
[] - @cmjohnson stages new PDU adjacent or in rack, and unplugs the failed side of the existing PDU, plugging in one side of the replacement PDU
[] - @cmjohnson migrates the now de-energized side of the old PDU plugs into the replacement PDU, returning redundant power to all devices
[] - @cmjognson de-energizes the remaining side of old PDU, energizing the replacement PDU fully, and migrates all remaining power to the new PDU
== Scheduling ==
This is slated to take place on @cmjohnson's morning hours on either Thurday, January 17th or on Tuesday, January 22nd. Having it in the EST AM allows for the largest amount of work-hours overlap with the largest number of affected services/groups.
== Servers & Devices in A2-eqiad ==
https://netbox.wikimedia.org/dcim/racks/2/
Network Devices:
asw2-a2-eqiad
asw-a2-eqiad
msw-a2-eqiad
Servers:
== analytics ==
conf1001 - not used anymore, in decom phase
kafka1012 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1013 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1023 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka-jumbo1002 - need some heads up (like 10 mins) to gracefully stop kafka on it
an-worker1078 - can go down anytime, but a little heads up would be good to gracefully shutdown
an-worker1079 - can go down anytime, but a little heads up would be good to gracefully shutdown
db1107 - shared with Data Persistence - this needs time due to 1) stop eventlogging 2) stop replication from db1108 3) stop mysql gracefully
== other ==
cloudelastic1001
db1074
db1075 - this is a master, cannot lose power
db1079
db1080
db1081
db1082
es1011
es1012
ms-be1019
ms-be1044
ms-be1045
tungsten - role(xhgui::app) - #performance-team