Page MenuHomePhabricator

b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)
Open, NormalPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B1-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-b1-eqiadasw@ayounsi
es1014esDBA@Marostegui to depool it before the maintenance
es1013esDBA@Marostegui to depool it before the maintenance
ms-be1022ms-be@fgiunchedipoweroff / poweron
db1084dbDBA@Marostegui to depool it before the maintenance
db1083dbDBA@Marostegui to depool it before the maintenance
kafka-jumbo1003Analyticsfine to do any time
db1077dbDBAtest host, nothing to be done
db1076dbDBA@Marostegui to depool it before the maintenance
db1112dbDBA@Marostegui to depool it before the maintenance
logstash1011@fgiunchediok with power loss, nice to have: disable es replication
cloudvirt1026cloudvirtcloud-services-team
cloudvirt1025cloudvirtcloud-services-team
dbstore1004dbstoreAnalytics
cloudvirt1023cloudvirtcloud-services-team
an-coord1001Analyticsfine to do any time but please ping Analytics first
dbproxy1014dbproxyDBAnothing to be done, not active
authdna1001authdnsTraffic
db1124dbDBAsanitarium host, nothing to be done
snapshot1008@ArielGlenn
db1118dbDBA@Marostegui to depool it before the maintenance
wdqs1007wdqsDiscovery-Search@Gehel good to go

Event Timeline

RobH updated the task description. (Show Details)Jul 9 2019, 12:22 AM
RobH updated the task description. (Show Details)
RobH added subscribers: ayounsi, ArielGlenn, fgiunchedi.
ArielGlenn added a subscriber: hoo.Jul 9 2019, 4:16 AM

Adding @hoo because wikidata entity dumps will be impacted.

elukey updated the task description. (Show Details)Jul 16 2019, 2:20 PM
elukey added a subscriber: elukey.

Some heads up could be good for me to gracefully stop daemons on an-coord1001. For kafka-jumbo1003 it is fine if it doesn't risk to loose power together with other kafka-jumbo nodes (2 down are tolerable, more probably not).

Marostegui updated the task description. (Show Details)Jul 23 2019, 7:09 AM
Marostegui added a subscriber: Marostegui.

From the DB side this can be done anytime

Marostegui updated the task description. (Show Details)Jul 23 2019, 9:17 AM
fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:32 AM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM
RobH removed RobH as the assignee of this task.Aug 14 2019, 4:51 PM
wiki_willy renamed this task from b1-eqiad pdu refresh to b1-eqiad pdu refresh (Thursday 10/10 @11am UTC).Aug 15 2019, 5:33 PM
CDanis triaged this task as Normal priority.Aug 16 2019, 1:01 PM
Marostegui updated the task description. (Show Details)Aug 19 2019, 10:30 AM
Gehel updated the task description. (Show Details)Aug 19 2019, 4:15 PM
Gehel added a subscriber: Gehel.
elukey updated the task description. (Show Details)Tue, Sep 17, 5:58 AM