Page MenuHomePhabricator

a8-eqiad pdu refresh (Thursday 10/17 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1-eqiad and ps2-eqiad to be replaced with new PDUs in rack A8-eqiad.

Each server, switch, and router will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

The network racks have two individual PDU towers existing, and will be replaced with two new PDU towers, so this swap is easier than the majority of the row A/B PDU swaps (with their combined old PDU towers.)

  • - schedule downtime for the entire list of switches, routers, and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - connect via serial / confirm serial connection works
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129. - https://gerrit.wikimedia.org/r/c/operations/puppet/+/543913

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
cr2-eqiadrouter@ayounsi
asw2-a8-eqiadasw@ayounsi
heliumbackup server@akosiariscan be done at any point in time
helium-arraybackup server@akosiariscan be done at any point in time
bohrium
db1129dbDBA team @Marostegui to depool this host before the maintenance
torrelay1001tor relay
db1117dbDBA teamthis is a passive slave on misc clusters, nothing to be done
labstore1003 (and its 3 arrays)labstorecloud-services-teamThis is decommissioned, can be done anytime

Event Timeline

RobH updated the task description. (Show Details)Jul 2 2019, 7:05 PM
RobH added a subscriber: ayounsi.
akosiaris updated the task description. (Show Details)Jul 23 2019, 7:03 AM
akosiaris added a subscriber: akosiaris.

From the DB side, this rack is good to go

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 2:17 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM
Marostegui updated the task description. (Show Details)Aug 6 2019, 8:07 AM
RobH removed RobH as the assignee of this task.Aug 14 2019, 4:52 PM
wiki_willy renamed this task from a8-eqiad pdu refresh to a8-eqiad pdu refresh (Thursday 9/19 @11am UTC).Aug 15 2019, 5:32 PM
CDanis triaged this task as Normal priority.Aug 16 2019, 1:02 PM

As this rack has one of our 2 most important routers I'd like to be around for the maintenance.
11am UTC is 4am pacific. It would be ideal if it could be pushed at least to 8am pacific, 15UTC.
Otherwise please make sure Mark or Faidon can be there.

wiki_willy renamed this task from a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) to a8-eqiad pdu refresh (Date TBA).Sep 16 2019, 5:03 PM
wiki_willy assigned this task to Cmjohnson.

Originally scheduled for Thursday 9/19, but will reschedule for a later date, since this is a network rack.

Bstorm updated the task description. (Show Details)Sep 16 2019, 5:05 PM
wiki_willy renamed this task from a8-eqiad pdu refresh (Date TBA) to a8-eqiad pdu refresh (Thursday 10/17 @11am UTC).Sep 30 2019, 5:18 PM

New target date for upgrading the PDUs on this network rack is Thursday 10/17 @11am UTC. @ayounsi will be in Europe this week to oversee, in case any potential issues occur. Thanks, Willy

RobH updated the task description. (Show Details)Oct 11 2019, 8:41 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-17T09:26:32Z] <marostegui> Stop MySQL on db1117 this will generate some haproxy alerts - T227133

Mentioned in SAL (#wikimedia-operations) [2019-10-17T09:37:54Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1129 for PDU work, give some traffic to db1090:3312 meanwhile T22meanwhile T227133', diff saved to https://phabricator.wikimedia.org/P9374 and previous config saved to /var/cache/conftool/dbconfig/20191017-093753-marostegui.json

db1129 and db1117 are good to go.

Mentioned in SAL (#wikimedia-operations) [2019-10-17T11:11:06Z] <XioNoX> failover vrrp from cr2-eqiad to cr1-eqiad - T227133

pdu swap completed all host online netbox update

Mentioned in SAL (#wikimedia-operations) [2019-10-17T13:06:36Z] <XioNoX> rollback failover vrrp from cr2-eqiad to cr1-eqiad - T227133

Cmjohnson updated the task description. (Show Details)Thu, Oct 17, 1:21 PM
RobH claimed this task.Thu, Oct 17, 6:07 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Thu, Oct 17, 6:27 PM
RobH closed this task as Resolved.Thu, Oct 24, 6:43 PM
RobH removed RobH as the assignee of this task.