This task will track the swapping of the PDU tower(s) in rack A3-eqiad.
The current PDU tower is malfunctioning, with a short having caused issues on both side B (wholly offline) and parts of side a (one circuit group of outlets has defunct outlets.)
Chris has onsite spares (2 dual-wide PDU towers with 48 ports per side) to test out and use for replacement in this rack.
Maintenance Window Scheduling
Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT)
Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT)
Estimated Duration: Up to 2 hours
Maintenance Window Checklist
The following steps must be met for this swap:
- - all servers will need to be taken offline and powered down for the duration of the migration
- - old pdu must be removed from the rack, new pdu installed, all power migrated over to it
The side B of A3-eqiad may also have had the circuit breaker tripped during the failure, and may require Equinix technicians to flip the breaker in the EQ circuit breaker box.
Servers & Devices in A3-eqiad
The following items are in a3-eqiad: https://netbox.wikimedia.org/dcim/racks/3/
Servers (grouped by service owner when possible):
cp1008 - canary host, has no production traffic, can be cleanly shutdown and powered back on after maint window.
db1103 - off
db1127 - server not even installed
dbproxy1001 - off
dbproxy1002 - off
dbproxy1003 - off
dbstore1003 - off
pc1004 - Not reachable via ssh, not in use, should be decommissioned (T210969) T213859#4883727.
relforge1001 - clean shutdown in advance of work and power back up afterwards
ganeti1007: The directions for https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node can be used for this work.
graphite1003 (just a spare, powered down)
kubernetes1001 (worker can be drained/powered down prior to maintenance)
prometheus1003 (powered down)
radium - (in decom, powered down)
rdb1005 - @jijiki will be around during the maint window for this system
@RobH synced with @Eevans about these. restbase 1016 is already offline. the other restbase systems can be logged into via SSH and cleanly shutdown just before the maintenance, and then powered back up normally post window.