Page MenuHomePhabricator

a7-eqiad pdu refresh
Open, NormalPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A7-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a7-eqiadasw@ayounsi
sodium
ms-fe1005ms-fe@fgiunchedineeds depool
an-worker1082Analytics
an-worker1081Analytics
cp1077cpTraffic
cp1078cpTraffic
ms-be1040ms-be@fgiunchedipoweroff / poweron
kafka-main1001@herron
dbprov1001db provisioningDBA
lvs1013lvsTraffic
mw1283mwserviceopscan be done at any point in time out of deployment windows
mw1282mwserviceopscan be done at any point in time out of deployment windows
mw1281mwserviceopscan be done at any point in time out of deployment windows
mw1280mwserviceopscan be done at any point in time out of deployment windows
mw1279mwserviceopscan be done at any point in time out of deployment windows
mw1278mwserviceopscan be done at any point in time out of deployment windows
mw1277mwserviceopscan be done at any point in time out of deployment windows
mw1276mwserviceopscan be done at any point in time out of deployment window
mw1275mwserviceopscan be done at any point in time out of deployment window
mw1274mwserviceopscan be done at any point in time out of deployment window
mw1273mwserviceopscan be done at any point in time out of deployment window
mw1272mwserviceopscan be done at any point in time out of deployment window
mw1271mwserviceopscan be done at any point in time out of deployment window
mw1270mwserviceopscan be done at any point in time out of deployment window
mw1269mwserviceopscan be done at any point in time out of deployment window
mw1268mwserviceopscan be done at any point in time out of deployment window
mw1267mwserviceopscan be done at any point in time out of deployment window
ms-be1030ms-be@fgiunchedipoweroff / poweron
ms-be1029ms-be@fgiunchedipoweroff / poweron
ms-be1028ms-be@fgiunchedipoweroff / poweron

Event Timeline

RobH updated the task description. (Show Details)Jul 3 2019, 10:06 PM
RobH added subscribers: ayounsi, fgiunchedi.
elukey updated the task description. (Show Details)Jul 16 2019, 2:22 PM
elukey added a subscriber: herron.
BBlack added a subscriber: BBlack.EditedJul 22 2019, 8:05 PM

The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).

BBlack updated the task description. (Show Details)Jul 22 2019, 8:09 PM

(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)

elukey updated the task description. (Show Details)Jul 23 2019, 6:43 AM
elukey added a subscriber: elukey.

Ok for the analytics nodes, hadoop workers that can go down without horrible consequences.

akosiaris updated the task description. (Show Details)Jul 23 2019, 7:02 AM

From the DB side this rack is good to go

Joe updated the task description. (Show Details)Jul 23 2019, 7:12 AM
fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:30 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:04:19Z] <bblack> depool cp1077 + cp1088 - T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:05:34Z] <bblack> lvs1013 - disable puppet and stop pybal - T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:13:17Z] <robh> started depooling servers in a7-eqiad for pdu work via T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:53:18Z] <robh> mw1271 had power loss event due to pdu swap via T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T19:10:36Z] <bblack> repool cp1077 + cp1078 - T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T19:11:27Z] <bblack> repool lvs1013 - T227143

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 2:17 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM
RobH removed RobH as the assignee of this task.Aug 14 2019, 4:53 PM
CDanis triaged this task as Normal priority.Aug 16 2019, 1:02 PM

This task seems to be done.

@Agusbou2015: Why? Please always elaborate why when adding comments.

While fixing phase check for new PDUs today I noticed tower B for ps1-a7-eqiad shows unknown while tower A is fine: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-a7-eqiad perhaps missing config? all other PDUs were reported OK once monitoring was fixed so I'm assuming there's sth different here on the PDU side