Page MenuHomePhabricator

a7-eqiad pdu refresh
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A7-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a7-eqiadasw@ayounsi
sodium
ms-fe1005ms-fe@fgiunchedineeds depool
an-worker1082Analytics
an-worker1081Analytics
cp1077cpTraffic
cp1078cpTraffic
ms-be1040ms-be@fgiunchedipoweroff / poweron
kafka-main1001@herron
dbprov1001db provisioningDBA
lvs1013lvsTraffic
mw1283mwserviceopscan be done at any point in time out of deployment windows
mw1282mwserviceopscan be done at any point in time out of deployment windows
mw1281mwserviceopscan be done at any point in time out of deployment windows
mw1280mwserviceopscan be done at any point in time out of deployment windows
mw1279mwserviceopscan be done at any point in time out of deployment windows
mw1278mwserviceopscan be done at any point in time out of deployment windows
mw1277mwserviceopscan be done at any point in time out of deployment windows
mw1276mwserviceopscan be done at any point in time out of deployment window
mw1275mwserviceopscan be done at any point in time out of deployment window
mw1274mwserviceopscan be done at any point in time out of deployment window
mw1273mwserviceopscan be done at any point in time out of deployment window
mw1272mwserviceopscan be done at any point in time out of deployment window
mw1271mwserviceopscan be done at any point in time out of deployment window
mw1270mwserviceopscan be done at any point in time out of deployment window
mw1269mwserviceopscan be done at any point in time out of deployment window
mw1268mwserviceopscan be done at any point in time out of deployment window
mw1267mwserviceopscan be done at any point in time out of deployment window
ms-be1030ms-be@fgiunchedipoweroff / poweron
ms-be1029ms-be@fgiunchedipoweroff / poweron
ms-be1028ms-be@fgiunchedipoweroff / poweron

Event Timeline

elukey added a subscriber: herron.

The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).

(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)

elukey added a subscriber: elukey.

Ok for the analytics nodes, hadoop workers that can go down without horrible consequences.

From the DB side this rack is good to go

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:05:34Z] <bblack> lvs1013 - disable puppet and stop pybal - T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:13:17Z] <robh> started depooling servers in a7-eqiad for pdu work via T227143

Mentioned in SAL (#wikimedia-operations) [2019-07-23T18:53:18Z] <robh> mw1271 had power loss event due to pdu swap via T227143

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:53 PM
CDanis triaged this task as Medium priority.Aug 16 2019, 1:02 PM

@Agusbou2015: Why? Please always elaborate why when adding comments.

While fixing phase check for new PDUs today I noticed tower B for ps1-a7-eqiad shows unknown while tower A is fine: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-a7-eqiad perhaps missing config? all other PDUs were reported OK once monitoring was fixed so I'm assuming there's sth different here on the PDU side

2019-09-18-190711_1393x198_scrot.png (198×1 px, 60 KB)

@RobH - can you check if the configuration on this one is complete? It was one of the PDUs you and Chris upgraded, when you went out to eqiad. Thanks, Willy

@RobH - can you check if the configuration on this one is complete? It was one of the PDUs you and Chris upgraded, when you went out to eqiad. Thanks, Willy

Just logged in and checked it out, its good and reporting into everything correctly as far as I can tell.