b4-eqiad pdu refresh (Thursday 10/24 @11am UTC)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	RobH
	Jul 8 2019, 10:46 PM

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B4-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

- schedule downtime for the entire list of switches and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- connect via serial / confirm serial connection works
- setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
- update PDU model in puppet per T233129.

List of routers, switches, and servers

device	role	SRE team coordination	notes
atlas-eqiad	RIPE atlas anchor	unknown	this has a single infeed power and will lose connection. is it best to disconnect before work and reconnect afterwards, rather than have it pop up and down multiple times during the work?
asw2-b4-eqiad	asw	@ayounsi	ensure asw doesn't lose all power or the entire rack goes offline from network
ruthenium	parsoid::testing
elastic1050	elastic system	@Gehel
prometheus1004	prometheus	@fgiunchedi
cloudvirt1007	cloudvirt host	cloud-services-team	@JHedden 21 active VMs, please handle with care
cloudvirt1006	cloudvirt host	cloud-services-team	@JHedden 17 active VMs, please handle with care
cloudvirt1005	cloudvirt host	cloud-services-team	@JHedden 27 active VMs, please handle with care
cloudvirt1019	cloudvirt host	cloud-services-team	@JHedden 2 active VMs, please handle with care
cloudvirt1004	cloudvirt host	cloud-services-team	@JHedden 19 active VMs, please handle with care
cloudvirt1003	cloudvirt host	cloud-services-team	@JHedden 17 active VMs, please handle with care
cloudvirt1021	cloudvirt host	cloud-services-team	@JHedden 25 active VMs, please handle with care
cloudvirt1016	cloudvirt host	cloud-services-team	@JHedden 58 active VMs, please handle with care
cloudnet1004	cloudvirt host	cloud-services-team	@JHedden can happen anytime, has redundant peer
cloudvirt1013	cloudvirt host	cloud-services-team	@JHedden 23 active VMs, please handle with care
conf1005	zookeeper/etc discovery service
phab1001	phabricator main system
iron
kubestage1002
kafka1002		@herron
an-worker1085	hadoop	Analytics	fine to do any time
maps1002	openstreetmaps slave server

Details

	Subject	Repo	Branch	Lines +/-
	setting sentry4 for ps1-b4-eqiad	operations/puppet	production	+4 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
		Resolved		None	T227540 b4-eqiad pdu refresh (Thursday 10/24 @11am UTC)

Event Timeline

RobH created this task.Jul 8 2019, 10:46 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).Jul 8 2019, 10:50 PM

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 22 2019, 2:41 PM

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

wiki_willy renamed this task from b4-eqiad pdu refresh to b4-eqiad pdu refresh (Thursday 10/24 @11am UTC).Aug 15 2019, 5:36 PM

RobH updated the task description. (Show Details)Aug 29 2019, 4:24 PM

RobH added subscribers: ayounsi, Gehel, • Nuria.

RobH updated the task description. (Show Details)Aug 29 2019, 6:08 PM

colewhite updated the task description. (Show Details)Aug 30 2019, 8:58 PM

colewhite added a subscriber: fgiunchedi.

RobH removed RobH as the assignee of this task.Sep 6 2019, 3:35 PM

jbond triaged this task as Medium priority.Sep 9 2019, 9:15 AM

elukey updated the task description. (Show Details)Sep 17 2019, 6:02 AM

elukey added a subscriber: herron.

RobH updated the task description. (Show Details)Oct 11 2019, 8:41 PM

wiki_willy assigned this task to • Cmjohnson.Oct 21 2019, 4:26 PM

• JHedden updated the task description. (Show Details)Oct 21 2019, 7:09 PM

• JHedden subscribed.

• Phamhi subscribed.Oct 22 2019, 4:22 PM

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T10:58:19Z] <arturo> icinga downtime for 1h (T227540) cloudvirt100[3-7], cloudvirt1019, cloudvirt1016, cloudvirt1021, cloudvirt1013, cloudnet1004

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:04:54Z] <arturo> stopped mariadb in clouddb1001 (T227540)

Starting Pdu Replacement

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:09:09Z] <phamhi> stopped postgresl in clouddb1004 (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:10:13Z] <arturo> icinga downtime for 2h (T227540) toolschecker

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:13:53Z] <arturo> poweroff VM clouddb1004 (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:14:59Z] <arturo> poweroff VM clouddb1001, hypervisor will be powered off (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:15:40Z] <arturo> poweroff cloudvirt1019 during the PDU operations (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T11:58:02Z] <arturo> icinga downtime for 2h (T227540) cloudvirt1019

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T12:30:24Z] <arturo> starting cloudvirt1019, PDU operations ended (T227540)

Mentioned in SAL (#wikimedia-cloud) [2019-10-24T12:34:17Z] <arturo> start both clouddb1001 and clouddb1004 (T227540)

• Cmjohnson updated the task description. (Show Details)Oct 24 2019, 12:47 PM

completed pdu refresh, Netbox update with new pdu and console

RobH claimed this task.Oct 24 2019, 6:21 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:22:05Z] <robh> completing ps1-b6-eqiad setup, pdu will reboot twice, power output unaffected T227540

bd808 mentioned this in T236420: ToolsDB unstable following unplanned software upgrade.Oct 24 2019, 6:26 PM

Change 545915 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting sentry4 for ps1-b4-eqiad

https://gerrit.wikimedia.org/r/545915

gerritbot added a project: Patch-For-Review.Oct 24 2019, 6:36 PM

Change 545915 merged by RobH:
[operations/puppet@production] setting sentry4 for ps1-b4-eqiad

https://gerrit.wikimedia.org/r/545915