This task will track the work required to prepare/stage, and then swap out the failed PDU tower in a2-eqiad. Details are as follows:
- ps1-a2-eqiad is a dual input (tower A and B combined in a single PDU chassis) with 24 ports per tower.
- ps1-a2-eqiad has had failures occur on its phases, either due to the PDU failing, or due to phase imbalance that cannot be corrected due to the limited number of power plugs per tower (only 24).
- Chris will swap out the existing/failing ps1-a2-eqiad and put in a spare dual wide, 42 port per tower PDU. This isn't as ideal as a brand new PDU (via T210776), but the new PDU has a 30 day lead time.
- All systems in a2-eqiad have to be reviewed, as downtime could result.
- All precautions will be taken to try to migrate PDUs without downtime, but nothing is a certainty when dealing with the power feeds into our rack.
- - list off all systems in a2-eqiad, check with service owners and schedule a downtime date before Chris leaves for all hands.
Maintenance Window Checklist
- - @Cmjohnson stages new PDU adjacent or in rack, and unplugs the failed side of the existing PDU, plugging in one side of the replacement PDU
- - @Cmjohnson migrates the now de-energized side of the old PDU plugs into the replacement PDU, returning redundant power to all devices
- - @Cmjohnson de-energizes the remaining side of old PDU, energizing the replacement PDU fully, and migrates all remaining power to the new PDU
Maintenance Window Scheduling
Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT)
Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT)
Estimated Duration: Up to 2 hours
Servers & Devices in A2-eqiad
https://netbox.wikimedia.org/dcim/racks/2/
Network Devices:
The primary access switch for this row needs to be cross-cabled, just in case.
asw2-a2-eqiad
asw-a2-eqiad
msw-a2-eqiad
Servers:
analytics
conf1001 - not used anymore, in decom phase
kafka1012 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1013 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1023 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka-jumbo1002 - need some heads up (like 10 mins) to gracefully stop kafka on it
an-worker1078 - can go down anytime, but a little heads up would be good to gracefully shutdown
an-worker1079 - can go down anytime, but a little heads up would be good to gracefully shutdown
db1107 - shared with Data Persistence - this needs time due to 1) stop eventlogging 2) stop replication from db1108 3) stop mysql gracefully
dba team systems
db1074 - replication slave, DBA team will stop mysql before work and restart after work ends
db1075 - This is not a master anymore (T213858) replication slave, DBA team will stop mysql before work and restart after work ends
db1079 - replication slave, DBA team will stop mysql before work and restart after work ends
db1080 - replication slave, DBA team will stop mysql before work and restart after work ends
db1081 - replication slave, DBA team will stop mysql before work and restart after work ends
db1082 - replication slave, DBA team will stop mysql before work and restart after work ends
es1011 - replication slave, DBA team will stop mysql before work and restart after work ends
es1012 - replication slave, DBA team will stop mysql before work and restart after work ends
other
cloudelastic1001 - not yet in use, can leave in place during pdu swap (no extra precautions needed)
ms-be1019 - can go down anytime, please issue a poweroff
ms-be1044 - can go down anytime, please issue a poweroff
ms-be1045 - can go down anytime, please issue a poweroff
tungsten - role(xhgui::app) - Performance-Team - @Gilles/@Krinkle confirm this can stay cabled normally, downtime wouldn't be problematic as long as its not for longer than the window.