Page MenuHomePhabricator

a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)
Open, NormalPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A6-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a6-eqiadasw@ayounsi
pc1007parsercacheDBAcan be failed over easily @Marostegui to depool this host
wtp1027parsoidserviceopsfine to do at any time
wtp1026parsoidserviceopsfine to do at any time
wtp1025parsoidserviceopsfine to do at any time
an-master1001Analytics
dbproxy1013dbproxyDBAnot active
elastic1045cirrus-searchDiscovery-Search@Gehel good to go
elastic1044cirrus-searchDiscovery-Search@Gehel good to go
elastic1048cirrus-searchDiscovery-Search@Gehel good to go
mc1023mcserviceops @elukeywill need to be depooled in advance
mc1022mcserviceops @elukeywill need to be depooled in advance
mc1021mcserviceops @elukeywill need to be depooled in advance
mc1020mcserviceops @elukeywill need to be depooled in advance
mc1019mcserviceops @elukeywill need to be depooled in advance
aqs1007Analytics
weblog1001
restbase1021restbase@jijikiok with power loss
labsdb1012labsdbDBA
db1066dbDBAprimary s2 db master, cannot lose power
db1116dbDBAbackup source, nothing to be done
db1115dbDBAtendril host, nothing to be done
labmon1002labmon
druid1004Analytics
wdqs1004wdqsDiscovery-Search@Gehel good to go
ores1001ores@akosiarisfine to do at any time
restbase-dev1004can be done at any time
cloudcontrol1003
mw1312mwserviceopsfine to do at any time outside of deployment windows
mw1311mwserviceopsfine to do at any time outside of deployment windows
mw1310mwserviceopsfine to do at any time outside of deployment windows
mw1309mwserviceopsfine to do at any time outside of deployment windows
mw1308mwserviceopsfine to do at any time outside of deployment windows
mw1307mwserviceopsfine to do at any time outside of deployment windows
ganeti1006ganeti node@akosiariswill need to be emptied in advance
db1096dbDBA@Marostegui to depool this host

Event Timeline

RobH updated the task description. (Show Details)Jul 3 2019, 9:52 PM
RobH added subscribers: ayounsi, akosiaris, fgiunchedi.
elukey updated the task description. (Show Details)Jul 16 2019, 2:12 PM
elukey added a subscriber: elukey.

Analytics side: if possible I'd need some heads up to force a failover for an-master1001.

Memcached side: we have 5 mc10XX shards in the same rack, loosing all of them could be a big problem with the current configuration of mcrouter. Explicitly adding @Joe and @jijiki to understand how to handle this.

akosiaris updated the task description. (Show Details)Jul 23 2019, 7:00 AM
akosiaris added a subscriber: MoritzMuehlenhoff.

sudo gnt-node migrate -f ganeti1006

This rack contains an active primary db master: db1066, this would need to be failed over if we are not confident about not losing power.

Marostegui updated the task description. (Show Details)Jul 23 2019, 7:06 AM
Joe updated the task description. (Show Details)Jul 23 2019, 7:09 AM
fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:29 AM
RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 2:17 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM
RobH removed RobH as the assignee of this task.Aug 14 2019, 4:53 PM
wiki_willy renamed this task from a6-eqiad pdu refresh to a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).Aug 15 2019, 5:31 PM
CDanis triaged this task as Normal priority.Aug 16 2019, 1:02 PM
Marostegui updated the task description. (Show Details)Mon, Aug 19, 10:36 AM
fgiunchedi updated the task description. (Show Details)Mon, Aug 19, 10:43 AM
Gehel updated the task description. (Show Details)Mon, Aug 19, 4:17 PM
Gehel added a subscriber: Gehel.

@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy

I will get them scheduled, planned etc. Thanks