Page MenuHomePhabricator

b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B7-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - before work starts, silence all icinga alerts until 8PM GMT same day
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - connect via serial / confirm serial connection works
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129.
  • - clear icinga errors for missing ps2 input by connecting/checking connection of the rj11 cable connection between ps1 and ps2 b7-eqiad. Once it is connected, the icinga errors for the tower B infeed will clear up.

List of routers, switches, and servers

deviceroleSRE team coordinationrecommended action during maintainance
asw-b7-eqiadasw@ayounsiensure this doesn't go offline as it will take entire rack network offline
wtp1033
wtp1032
wtp1031
kafka-main1002@herronTo avoid alert noise from adjacent kafka-main hosts, schedule icinga downtime for "Kafka Broker Under Replicated Partitions" service on kafka-main100[123] as well. Perform graceful shutdown of server before maintenance, and ensure powered up when completed.
dbprov1002db provisioning/backup generation hostDBAReally nothing to do, but @jcrespo will keep an eye on it
cloudvirtan1005
cloudvirtan1004
an-worker1087@Nuria
an-worker1086@Nuria
cp1082cp systemTrafficT227542#5355289
cp1081cp systemTrafficT227542#5355289
ms-be1041ms-be systemfillipogracefully shutdown the host just before rack maintainance, and power it back online post-maintainance.
cloudvirt1022cloudvirt hostcloud-services-team@JHedden No running VMs, can happen anytime
analytics1073Analyticsfine to do any time
lvs1014lvs system@BBlackT227542#5355289
cloudvirt1020cloudvirt hostcloud-services-team@JHedden has running VMs please handle with care
druid1005Analyticsfine to do any time
ores1003
cloudnet1003cloud-services-team@JHedden is active but it has a redundant peer
restbase-dev1005
cloudcontrol1004cloud-services-team@JHedden is active but it has a redundant peer
cloudvirt1017cloudvirtcloud-services-team@JHedden has a large number of running VMs, please handle with care
mw1318mw server@Joe
mw1317mw server@Joe
mw1316mw server@Joe
mw1315mw server@Joe
mw1314mw server@Joe
mw1313mw server@Joe

Details

Due Date
Nov 15 2019, 11:00 AM

Event Timeline

lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.

wiki_willy renamed this task from b7-eqiad pdu refresh to b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC).Aug 15 2019, 5:38 PM
RobH triaged this task as High priority.Aug 28 2019, 6:31 PM
RobH updated the task description. (Show Details)
RobH removed RobH as the assignee of this task.Aug 28 2019, 6:39 PM
RobH updated the task description. (Show Details)
RobH set Due Date to Nov 15 2019, 12:00 AM.
RobH changed Due Date from Nov 15 2019, 12:00 AM to Nov 15 2019, 11:00 AM.
RobH added subscribers: ayounsi, Nuria, Joe.
elukey added a subscriber: herron.
wiki_willy renamed this task from b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC) to b7-eqiad pdu refresh (Tuesday 11/5 @10am UTC).Oct 30 2019, 12:26 AM
wiki_willy renamed this task from b7-eqiad pdu refresh (Tuesday 11/5 @10am UTC) to b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC).Nov 4 2019, 4:19 PM

I don't want to conflict-edit the task description, but as far as the MW* and WTP* servers no action is needed.

Mentioned in SAL (#wikimedia-cloud) [2019-11-05T11:59:38Z] <arturo> icinga downtime for 1h cloudcontrol1004, cloudnet1003, cloudvirt1017/1020/1022 for PDU operations in the rack T227542

Jclark-ctr updated the task description. (Show Details)
# pmshell

 1: ps1-a1-eqiad    2: ps1-a2-eqiad    3: ps1-a3-eqiad    4: ps1-a4-eqiad   
 5: ps1-a5-eqiad    6: ps1-a6-eqiad    7: ps1-a7-eqiad    8: ps1-a8-eqiad   
 9: ps1-b1-eqiad   10: ps1-b2-eqiad   11: ps1-b3-eqiad   12: ps1-b4-eqiad   
13: ps1-b5-eqiad   14: ps1-b6-eqiad   15: ps1-b7-eqiad   16: ps1-b8-eqiad   
17: asw-a1-eqiad   18: asw-a2-eqiad   19: asw-a3-eqiad   20: asw-a4-eqiad   
21: asw-a5-eqiad   22: asw-a6-eqiad   23: asw-a7-eqiad   24: asw-a8-eqiad   
25: asw-b1-eqiad   26: asw-b2-eqiad   27: asw-b3-eqiad   28: asw-b4-eqiad   
29: asw-b5-eqiad   30: asw-b6-eqiad   31: asw-b7-eqiad   32: asw-b8-eqiad   
33: re0.cr1-eqiad  34: re1.cr1-eqiad  35: re0.cr2-eqiad  36: re1.cr2-eqiad  
37: mr1-eqiad      40: msw1-eqiad     41: asw2-a5-eqiad  45: asw2-a3-eqiad  

Connect to port > 15  

Sentry Smart PDU Version 8.0n

Username:

Change 548769 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] ps1-b7-eqiad model setting

https://gerrit.wikimedia.org/r/548769

Change 548769 merged by RobH:
[operations/puppet@production] ps1-b7-eqiad model setting

https://gerrit.wikimedia.org/r/548769

RobH updated the task description. (Show Details)
  • - clear icinga errors for missing ps2 input by connecting/checking connection of the rj11 cable connection between ps1 and ps2 b7-eqiad. Once it is connected, the icinga errors for the tower B infeed will clear up.

@Jclark-ctr: Please see the update above and address, thanks!

Once the towers are linked, the errors on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ps1-b7-eqiad should clear up and go green for tower B.

Jclark-ctr updated the task description. (Show Details)

confirmed link and errors cleared from icinga