This task will track the work required to prepare/stage, and then swap out the failed PDU tower in a2-eqiad. Details are as follows:
* ps1-a2-eqiad is a dual input (tower A and B combined in a single PDU chassis) with 24 ports per tower.
* ps1-a2-eqiad has had failures occur on its phases, either due to the PDU failing, or due to phase imbalance that cannot be corrected due to the limited number of power plugs per tower (only 24).
** Chris will swap out the existing/failing ps1-a2-eqiad and put in a spare dual wide, 42 port per tower PDU. This isn't as ideal as a brand new PDU (via T210776), but the new PDU has a 30 day lead time.
* All systems in a2-eqiad have to be reviewed, as downtime could result.
** All precautions will be taken to try to migrate PDUs without downtime, but nothing is a certainty when dealing with the power feeds into our rack.
[x] - list off all systems in a2-eqiad, check with service owners and schedule a downtime date before Chris leaves for all hands.
=== Maintenance Window Checklist ===
[] - @cmjohnson stages new PDU adjacent or in rack, and unplugs the failed side of the existing PDU, plugging in one side of the replacement PDU
[] - @cmjohnson migrates the now de-energized side of the old PDU plugs into the replacement PDU, returning redundant power to all devices
[] - @cmjognson de-energizes the remaining side of old PDU, energizing the replacement PDU fully, and migrates all remaining power to the new PDU
== Scheduling ==Maintenance Window Scheduling ==
Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT)
Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT)
This is slated to take place on @cmjohnson's morning hours on either Thursday, January 17th or on Tuesday, January 22nd. Having it in the EST AM allows for the largest amount of work-hours overlap with the largest number of affected services/groups.Estimated Duration: Up to 2 hours
== Servers & Devices in A2-eqiad ==
https://netbox.wikimedia.org/dcim/racks/2/
Network Devices:
asw2-a2-eqiad
asw-a2-eqiad
msw-a2-eqiad
Servers:
=== analytics ===
conf1001 - not used anymore, in decom phase
kafka1012 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1013 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1023 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka-jumbo1002 - need some heads up (like 10 mins) to gracefully stop kafka on it
an-worker1078 - can go down anytime, but a little heads up would be good to gracefully shutdown
an-worker1079 - can go down anytime, but a little heads up would be good to gracefully shutdown
db1107 - shared with Data Persistence - this needs time due to 1) stop eventlogging 2) stop replication from db1108 3) stop mysql gracefully
=== dba team systems ===
db1074 - replication slave, #dba team will stop mysql before work and restart after work ends
db1075 - this is a master, cannot lose power
db1079 - replication slave, #dba team will stop mysql before work and restart after work ends
db1080 - replication slave, #dba team will stop mysql before work and restart after work ends
db1081 - replication slave, #dba team will stop mysql before work and restart after work ends
db1082 - replication slave, #dba team will stop mysql before work and restart after work ends
es1011 - replication slave, #dba team will stop mysql before work and restart after work ends
es1012 - replication slave, #dba team will stop mysql before work and restart after work ends
== other ==
cloudelastic1001 - not yet in use, can leave in place during pdu swap (no extra precautions needed)
ms-be1019
ms-be1044
ms-be1045
tungsten - role(xhgui::app) - #performance-team - @gilles expects this can offline without any major issues, but we should check with @krinkle as well.