HomePhabricator

Transactions T227543 Change Details

Change Details

Old
New
Diff

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack [[ https://netbox.wikimedia.org/dcim/racks/16/ | B8-eqiad ]]. Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers. These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers. [] - schedule downtime for the entire list of switches and servers. [] - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - Once new PDU tower is confirmed online, move on to next steps. [] - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - connect via serial / confirm serial connection works [] - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup [] - update PDU model in puppet per T233129. == List of routers, switches, and servers == | device | role | SRE team coordination | recommended action during maintainance | asw-b8-eqiad | asw | @ayounsi | ensure this doesn't go offline as it will take entire rack network offline | ganeti1018 | ganeti host |#serviceops | needs to be emptied of VMs before | | gerrit1001 | spare | | fine to do at anytime | | cloudvirt1030 | | #cloud-services-team | | | db1132 | m2 master |#dba | This host is m2 master which holds some internal services, ensure it doesn't go offline, if it does, there is an automatic failover via proxies. | pc1008 | parsercache host | #dba | #dba to depool it | | restbase1024 | restbase | #serviceops, #services | fine to do at anytime | | an-master1002 | | #analytics| fine to do any time | dbproxy1015 | db proxy |#dba | Not in use | graphite1004 | | @fgiunchedi | no action needed, if power is lost and can't be restored quickly we'll switch to codfw | | rdb1009 | redis master | #serviceops | this will need coordination? | | notebook1003 | | | | | db1119 | db host |#dba | #dba to depool it | db1113 | db host |#dba | #dba to depool it | cloudservices1003 | | #cloud-services-team | | | mwmaint1002 | | | This is the primary mw maint system in eqiad, perhaps we should halt deployments during this time? | labpuppetmaster1001 | | #cloud-services-team | | | ores1004 | ORES | #serviceops | fine do to at any time | | wtp1036 | parsoid | #serviceops | fine to do at any time | | wtp1035 | parsoid | #serviceops | fine to do at any time | | wtp1034 | parsoid | #serviceops | fine to do at any time | | dumpsdata1001 | dumps data server | @arielglenn | coordinate please | | analytics1063 | | #analytics| fine to do any time | analytics1062 | | #analytics| fine to do any time | analytics1061 | | #analytics| fine to do any time

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack [[ https://netbox.wikimedia.org/dcim/racks/16/ | B8-eqiad ]]. Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers. These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers. [] - schedule downtime for the entire list of switches and servers. [] - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - Once new PDU tower is confirmed online, move on to next steps. [] - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - connect via serial / confirm serial connection works [] - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup [] - update PDU model in puppet per T233129. == List of routers, switches, and servers == | device | role | SRE team coordination | recommended action during maintainance | asw-b8-eqiad | asw | @ayounsi | ensure this doesn't go offline as it will take entire rack network offline | ganeti1018 | ganeti host |#serviceops | needs to be emptied of VMs before | | gerrit1001 | spare | | fine to do at anytime | | cloudvirt1030 | hypervisor | #cloud-services-team | Lots of VMs, please handle with care. | db1132 | m2 master |#dba | This host is m2 master which holds some internal services, ensure it doesn't go offline, if it does, there is an automatic failover via proxies. | pc1008 | parsercache host | #dba | #dba to depool it | | restbase1024 | restbase | #serviceops, #services | fine to do at anytime | | an-master1002 | | #analytics| fine to do any time | dbproxy1015 | db proxy |#dba | Not in use | graphite1004 | | @fgiunchedi | no action needed, if power is lost and can't be restored quickly we'll switch to codfw | | rdb1009 | redis master | #serviceops | this will need coordination? | | notebook1003 | | | | | db1119 | db host |#dba | #dba to depool it | db1113 | db host |#dba | #dba to depool it | cloudservices1003 | DNS | #cloud-services-team | fine to do at anytime | mwmaint1002 | | | This is the primary mw maint system in eqiad, perhaps we should halt deployments during this time? | labpuppetmaster1001 | spare | #cloud-services-team | Good to go. Host is being decommissioned. | ores1004 | ORES | #serviceops | fine do to at any time | | wtp1036 | parsoid | #serviceops | fine to do at any time | | wtp1035 | parsoid | #serviceops | fine to do at any time | | wtp1034 | parsoid | #serviceops | fine to do at any time | | dumpsdata1001 | dumps data server | @arielglenn | coordinate please | | analytics1063 | | #analytics| fine to do any time | analytics1062 | | #analytics| fine to do any time | analytics1061 | | #analytics| fine to do any time

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack [[ https://netbox.wikimedia.org/dcim/racks/16/ | B8-eqiad ]]. Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers. These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers. [] - schedule downtime for the entire list of switches and servers. [] - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - Once new PDU tower is confirmed online, move on to next steps. [] - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized). [] - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower [] - connect via serial / confirm serial connection works [] - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup [] - update PDU model in puppet per T233129. == List of routers, switches, and servers == | device | role | SRE team coordination | recommended action during maintainance | asw-b8-eqiad | asw | @ayounsi | ensure this doesn't go offline as it will take entire rack network offline | ganeti1018 | ganeti host |#serviceops | needs to be emptied of VMs before | | gerrit1001 | spare | | fine to do at anytime | | cloudvirt1030 | hypervisor | #cloud-services-team | |Lots of VMs, please handle with care. | db1132 | m2 master |#dba | This host is m2 master which holds some internal services, ensure it doesn't go offline, if it does, there is an automatic failover via proxies. | pc1008 | parsercache host | #dba | #dba to depool it | | restbase1024 | restbase | #serviceops, #services | fine to do at anytime | | an-master1002 | | #analytics| fine to do any time | dbproxy1015 | db proxy |#dba | Not in use | graphite1004 | | @fgiunchedi | no action needed, if power is lost and can't be restored quickly we'll switch to codfw | | rdb1009 | redis master | #serviceops | this will need coordination? | | notebook1003 | | | | | db1119 | db host |#dba | #dba to depool it | db1113 | db host |#dba | #dba to depool it | cloudservices1003 | DNS | #cloud-services-team | |fine to do at anytime | mwmaint1002 | | | This is the primary mw maint system in eqiad, perhaps we should halt deployments during this time? | labpuppetmaster1001 | spare | #cloud-services-team | |Good to go. Host is being decommissioned. | ores1004 | ORES | #serviceops | fine do to at any time | | wtp1036 | parsoid | #serviceops | fine to do at any time | | wtp1035 | parsoid | #serviceops | fine to do at any time | | wtp1034 | parsoid | #serviceops | fine to do at any time | | dumpsdata1001 | dumps data server | @arielglenn | coordinate please | | analytics1063 | | #analytics| fine to do any time | analytics1062 | | #analytics| fine to do any time | analytics1061 | | #analytics| fine to do any time

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL