Page MenuHomePhabricator

a3-eqiad pdu refresh
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A3-eqiad.

Downtime Window: 2019-07-23 @ 14:05 GMT. Expected window of 1.5 hours maximum. (first PDU swap took less than an hour.)

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - carefully unmount the existing PDU, KEEPING SYSTEMS PLUGGED IN AND POWERED ON UNLESS STATED OTHERWISE
  • - set PDU aside in rack, still energized, and remove old mounting brackets.
  • - install new mounting brackets, mount BOTH new PDU towers.
  • - Wire up the inner of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized). Using the one closest to the servers first makes re-wiring power easier.
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - issue with elastic1031, @Cmjohnson making followup task

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a3-eqiadasw@ayounsi
analytics1060analyticsAnalytics
analytics1059analyticsAnalytics
analytics1057analyticsAnalytics
analytics1056analyticsAnalytics
analytics1055analyticsAnalytics
analytics1054analyticsAnalytics
analytics1052analyticsAnalytics
elastic1031elasticDiscovery-Search
elastic1030elasticDiscovery-Search
logstash1010observabilityok with power loss, nice to have: disable es replication
cloudservices1004cloud-services-team
restbase1016@fgiunchediok with power loss
kubernetes1001kubernetesserviceops
rdb1005misc redisserviceopsok with powerloss
restbase1019@fgiunchediok with power loss
restbase1011@fgiunchediok with power loss
restbase1010@fgiunchediok with power loss
graphite1003awaiting decom
relforge1001
db1103dbDBA
dbproxy1003dbproxyDBA
elastic1035elasticDiscovery-Search
elastic1034elasticDiscovery-Search
elastic1033elasticDiscovery-Search
elastic1032elasticDiscovery-Search
cp1008cpTraffic
dbstore1003dbstoreAnalytics
prometheus1003observabilityok with power loss
ganeti1007ganeti host@akosiarishost will need to be emptied in advance
dbproxy1001dbproxyDBA
dbproxy1002dbproxyDBA
db1127dbDBA
radium

Event Timeline

RobH created this task.Jul 2 2019, 7:59 PM
RobH updated the task description. (Show Details)Jul 2 2019, 8:37 PM
RobH added subscribers: ayounsi, akosiaris, fgiunchedi.
RobH updated the task description. (Show Details)Jul 9 2019, 12:16 AM
elukey added a subscriber: elukey.Jul 16 2019, 10:00 AM

All the analytics nodes are hadoop workers, not a big deal if they loose power.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T04:43:46Z] <marostegui> Failover m1 from dbproxy1001 to dbproxy1006 T227139

@RobH I have failed over dbproxy1001 to dbproxy1006 so this rack is good to go from the DB point of view.

akosiaris updated the task description. (Show Details)Jul 23 2019, 6:44 AM
akosiaris added a subscriber: MoritzMuehlenhoff.EditedJul 23 2019, 6:49 AM

sudo gnt-node migrate -f ganeti1007

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:14 AM
Marostegui updated the task description. (Show Details)Jul 23 2019, 9:16 AM

restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss, if feeling nice restbase and prometheus should be depooled. for the logstash host we could disable es replication before and reenable it afterwards, to avoid shuffling data around on power loss

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:27 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-23T12:01:07Z] <akosiaris> empty ganeti1007 from running instances. T227139

Mentioned in SAL (#wikimedia-operations) [2019-07-23T12:02:11Z] <akosiaris> drain kubernetes1001. T227139

RobH added a comment.Tue, Jul 23, 12:04 PM

FYI: I pinged both Alex and Filippo to drain the respective servers they mention above in anticipation of swapping the PDUs in this rack at 10:00 Eastern time.

A3 was originally a DB rack, and has an older PDU model with less plugs than the other remaining PDUs (exception of networking racks) in rows A/B. It is also fairly sparsely populated at this time, so it is ideal to swap.

RobH updated the task description. (Show Details)Tue, Jul 23, 12:05 PM
RobH triaged this task as High priority.Tue, Jul 23, 12:08 PM
RobH updated the task description. (Show Details)

restbase / logstash / graphite / prometheus hosts should be fine in an event of power loss,

This is graphite1003, the old server pending decomission, the currently active one in eqiad (1004) is on a different rack.

jijiki updated the task description. (Show Details)Tue, Jul 23, 1:09 PM
jijiki updated the task description. (Show Details)
jijiki added a subscriber: jijiki.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T13:45:39Z] <godog> depool restbase1016 restbase1019 restbase1011 restbase1010 prometheus1003 ahead of PDU work - T227139

Mentioned in SAL (#wikimedia-operations) [2019-07-23T14:14:28Z] <robh> a3-eqiad pdu swap taking place now via T227139

RobH added a comment.Tue, Jul 23, 3:02 PM

All of the power has been migrated, and we are now setting up the networkign for the new pdus

RobH closed this task as Resolved.Tue, Jul 23, 3:16 PM
RobH updated the task description. (Show Details)

All done. Elastic1031 has a PSU issue, and we lost power to dbproxy1003 (it was not in service) during this migration.

Mentioned in SAL (#wikimedia-operations) [2019-07-23T16:22:11Z] <godog> pool prometheus1003 - T227139

Mentioned in SAL (#wikimedia-operations) [2019-07-25T09:21:25Z] <marostegui> Failover m1 from dbproxy1006 to dbproxy1001 - T227139