Change Details

This task will track the swapping of the PDU tower(s) in rack A3-eqiad. The current PDU tower is malfunctioning, with a short having caused issues on both side B (wholly offline) and parts of side a (one circuit group of outlets has defunct outlets.) Chris has onsite spares (2 dual-wide PDU towers with 48 ports per side) to test out and use for replacement in this rack. == Maintenance Window Scheduling == Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT) Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT) Estimated Duration: Up to 2 hours === Maintenance Window Checklist === The following steps must be met for this swap: [] - all servers will need to be taken offline and powered down for the duration of the migration [] - old pdu must be removed from the rack, new pdu installed, all power migrated over to it The side B of A3-eqiad may also have had the circuit breaker tripped during the failure, and may require Equinix technicians to flip the breaker in the EQ circuit breaker box. == Servers & Devices in A3-eqiad == The following items are in a3-eqiad: https://netbox.wikimedia.org/dcim/racks/3/ Servers (grouped by service owner when possible): Analytics: analytics1052 analytics1053 analytics1054 analytics1055 analytics1056 analytics1057 analytics1059 analytics1060 cloud: cloudservices1004 traffic: cp1008 - canary host, has no production traffic, can be cleanly shutdown and powered back on after maint window. dba: db1103 - we need to depool and stop mysql before the maintenance db1127 - not in use dbproxy1001 - standby host dbproxy1002 - standby host dbproxy1003 - needs to be failed over to dbproxy1008 dbstore1003 - not in use pc1004 - not in use, should be decommissioned (T210969) discovery: >>! In T213859#4882650, @Gehel wrote: > For `elastic103[0-5]`, we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it. > > That being said, 6 servers is a sizable portion of the cluster, I'd like to be around when that happens so that I can keep an eye on things. > > Note: the Icinga "ElasticSearch health check for shards" is going to raise an alert if not silenced (not paging). I don't think any other alert should be raised, but we'll see. elastic1030 elastic1031 elastic1032 elastic1033 elastic1034 elastic1035 relforge1001 - clean shutdown in advance of work and power back up afterwards misc: ganeti1007: The directions for https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node can be used for this work. graphite1003 kubernetes1001 prometheus1003 radium - tor relay rdb1005 services: @robh synced with @Eevans about these. restbase 1016 is already offline. the other restbase systems can be logged into via SSH and cleanly shutdown just before the maintenance, and then powered back up normally post window. restbase1010 restbase1011 restbase1016

This task will track the swapping of the PDU tower(s) in rack A3-eqiad. The current PDU tower is malfunctioning, with a short having caused issues on both side B (wholly offline) and parts of side a (one circuit group of outlets has defunct outlets.) Chris has onsite spares (2 dual-wide PDU towers with 48 ports per side) to test out and use for replacement in this rack. == Maintenance Window Scheduling == Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT) Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT) Estimated Duration: Up to 2 hours === Maintenance Window Checklist === The following steps must be met for this swap: [] - all servers will need to be taken offline and powered down for the duration of the migration [] - old pdu must be removed from the rack, new pdu installed, all power migrated over to it The side B of A3-eqiad may also have had the circuit breaker tripped during the failure, and may require Equinix technicians to flip the breaker in the EQ circuit breaker box. == Servers & Devices in A3-eqiad == The following items are in a3-eqiad: https://netbox.wikimedia.org/dcim/racks/3/ Servers (grouped by service owner when possible): Analytics: analytics1052 analytics1053 analytics1054 analytics1055 analytics1056 analytics1057 analytics1059 analytics1060 cloud: cloudservices1004 traffic: cp1008 - canary host, has no production traffic, can be cleanly shutdown and powered back on after maint window. dba: db1103 - we need to depool and stop mysql before the maintenance db1127 - not in use dbproxy1001 - standby host dbproxy1002 - standby host dbproxy1003 - needs to be failed over to dbproxy1008 dbstore1003 - not in use pc1004 - not in use, should be decommissioned (T210969) discovery: >>! In T213859#4882650, @Gehel wrote: > For `elastic103[0-5]`, we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it. > > That being said, 6 servers is a sizable portion of the cluster, I'd like to be around when that happens so that I can keep an eye on things. > > Note: the Icinga "ElasticSearch health check for shards" is going to raise an alert if not silenced (not paging). I don't think any other alert should be raised, but we'll see. elastic1030 elastic1031 elastic1032 elastic1033 elastic1034 elastic1035 relforge1001 - clean shutdown in advance of work and power back up afterwards misc: ganeti1007: The directions for https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node can be used for this work. graphite1003 kubernetes1001 prometheus1003 radium - tor relay rdb1005 - @jijiki will be around during the maint window for this system services: @robh synced with @Eevans about these. restbase 1016 is already offline. the other restbase systems can be logged into via SSH and cleanly shutdown just before the maintenance, and then powered back up normally post window. restbase1010 restbase1011 restbase1016