Page MenuHomePhabricator

b4-eqiad pdu refresh (Thursday 10/24 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B4-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - connect via serial / confirm serial connection works
  • - setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
  • - update PDU model in puppet per T233129.

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
atlas-eqiadRIPE atlas anchorunknownthis has a single infeed power and will lose connection. is it best to disconnect before work and reconnect afterwards, rather than have it pop up and down multiple times during the work?
asw2-b4-eqiadasw@ayounsiensure asw doesn't lose all power or the entire rack goes offline from network
rutheniumparsoid::testing
elastic1050elastic system@Gehel
prometheus1004prometheus@fgiunchedi
cloudvirt1007cloudvirt hostcloud-services-team@JHedden 21 active VMs, please handle with care
cloudvirt1006cloudvirt hostcloud-services-team@JHedden 17 active VMs, please handle with care
cloudvirt1005cloudvirt hostcloud-services-team@JHedden 27 active VMs, please handle with care
cloudvirt1019cloudvirt hostcloud-services-team@JHedden 2 active VMs, please handle with care
cloudvirt1004cloudvirt hostcloud-services-team@JHedden 19 active VMs, please handle with care
cloudvirt1003cloudvirt hostcloud-services-team@JHedden 17 active VMs, please handle with care
cloudvirt1021cloudvirt hostcloud-services-team@JHedden 25 active VMs, please handle with care
cloudvirt1016cloudvirt hostcloud-services-team@JHedden 58 active VMs, please handle with care
cloudnet1004cloudvirt hostcloud-services-team@JHedden can happen anytime, has redundant peer
cloudvirt1013cloudvirt hostcloud-services-team@JHedden 23 active VMs, please handle with care
conf1005zookeeper/etc discovery service
phab1001phabricator main system
iron
kubestage1002
kafka1002@herron
an-worker1085hadoopAnalyticsfine to do any time
maps1002openstreetmaps slave server

Event Timeline

wiki_willy renamed this task from b4-eqiad pdu refresh to b4-eqiad pdu refresh (Thursday 10/24 @11am UTC).Aug 15 2019, 5:36 PM
RobH removed RobH as the assignee of this task.Sep 6 2019, 3:35 PM
jbond triaged this task as Medium priority.Sep 9 2019, 9:15 AM
elukey added a subscriber: herron.

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T10:58:19Z] <arturo> icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:04:54Z] <arturo> stopped mariadb in clouddb1001 (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:09:09Z] <phamhi> stopped postgresl in clouddb1004 (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:10:13Z] <arturo> icinga downtime for 2h (T227540) toolschecker

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:13:53Z] <arturo> poweroff VM clouddb1004 (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:14:59Z] <arturo> poweroff VM clouddb1001, hypervisor will be powered off (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:15:40Z] <arturo> poweroff cloudvirt1019 during the PDU operations (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:58:02Z] <arturo> icinga downtime for 2h (T227540) cloudvirt1019

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T12:30:24Z] <arturo> starting cloudvirt1019, PDU operations ended (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T12:34:17Z] <arturo> start both clouddb1001 and clouddb1004 (T227540)

completed pdu refresh, Netbox update with new pdu and console

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:22:05Z] <robh> completing ps1-b6-eqiad setup, pdu will reboot twice, power output unaffected T227540

Change 545915 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting sentry4 for ps1-b4-eqiad

https://gerrit.wikimedia.org/r/545915

Change 545915 merged by RobH:
[operations/puppet@production] setting sentry4 for ps1-b4-eqiad

https://gerrit.wikimedia.org/r/545915

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

All changes merged, when puppet runs on icinga it'll clear the alerts.

RobH removed RobH as the assignee of this task.Oct 24 2019, 6:42 PM