b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	RobH
	Jul 8 2019, 10:42 PM

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack B1-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

- schedule downtime for the entire list of switches and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

device	role	SRE team coordination	notes
asw2-b1-eqiad	asw	@ayounsi
es1014	es	DBA	@Marostegui to depool it before the maintenance
es1013	es	DBA	@Marostegui to depool it before the maintenance
ms-be1022	ms-be	@fgiunchedi	poweroff / poweron
db1084	db	DBA	@Marostegui to depool it before the maintenance
db1083	db	DBA	@Marostegui to depool it before the maintenance
kafka-jumbo1003		Analytics	fine to do any time
db1077	db	DBA	test host, nothing to be done
db1076	db	DBA	@Marostegui to depool it before the maintenance
db1112	db	DBA	@Marostegui to depool it before the maintenance
logstash1011		@fgiunchedi	ok with power loss, nice to have: disable es replication
cloudvirt1026	cloudvirt	cloud-services-team	@aborrero: running 39 VMs, please handle with care
cloudvirt1025	cloudvirt	cloud-services-team	@aborrero: running 46 VMs, please handle with care
dbstore1004	dbstore	Analytics
cloudvirt1023	cloudvirt	cloud-services-team	@aborrero: OK, not running any VM.
an-coord1001		Analytics	fine to do any time but please ping Analytics first
dbproxy1014	dbproxy	DBA	nothing to be done, not active
authdna1001	authdns	Traffic
db1124	db	DBA	sanitarium host, nothing to be done
snapshot1008		@ArielGlenn
db1118	db	DBA	@Marostegui to depool it before the maintenance
wdqs1007	wdqs	Discovery-Search	@Gehel good to go

Details

	Subject	Repo	Branch	Lines +/-
	PDUs: add model sentry 4 to eqiad b1 and a2	operations/puppet	production	+8 -6
	db-eqiad.php: Depool es1013, es1014	operations/mediawiki-config	master	+2 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
		Resolved		None	T227536 b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)

Event Timeline

RobH created this task.Jul 8 2019, 10:42 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).

RobH updated the task description. (Show Details)Jul 9 2019, 12:22 AM

RobH updated the task description. (Show Details)

RobH added subscribers: ayounsi, ArielGlenn, fgiunchedi.

Adding @hoo because wikidata entity dumps will be impacted.

Some heads up could be good for me to gracefully stop daemons on an-coord1001. For kafka-jumbo1003 it is fine if it doesn't risk to loose power together with other kafka-jumbo nodes (2 down are tolerable, more probably not).

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 22 2019, 2:42 PM

From the DB side this can be done anytime

• Marostegui updated the task description. (Show Details)Jul 23 2019, 9:17 AM

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:32 AM

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:51 PM

wiki_willy renamed this task from b1-eqiad pdu refresh to b1-eqiad pdu refresh (Thursday 10/10 @11am UTC).Aug 15 2019, 5:33 PM

CDanis triaged this task as Medium priority.Aug 16 2019, 1:01 PM

• Marostegui updated the task description. (Show Details)Aug 19 2019, 10:30 AM

Gehel updated the task description. (Show Details)Aug 19 2019, 4:15 PM

Gehel subscribed.

elukey updated the task description. (Show Details)Sep 17 2019, 5:58 AM

wiki_willy assigned this task to Jclark-ctr.Oct 7 2019, 3:41 PM

Change 541707 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

gerritbot added a project: Patch-For-Review.Oct 9 2019, 5:41 AM

Change 541707 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1013, es1014

https://gerrit.wikimedia.org/r/541707

Mentioned in SAL (#wikimedia-operations) [2019-10-09T05:45:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1013, es1014 T227536 (duration: 01m 00s)

Maintenance_bot removed a project: Patch-For-Review.Oct 9 2019, 6:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:47:40Z] <marostegui> Depool db1084 db1083 db1076 db1118 for PDU maintenance - T227536

Mentioned in SAL (#wikimedia-operations) [2019-10-10T05:51:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1112 for PDU maintenance T227536', diff saved to https://phabricator.wikimedia.org/P9294 and previous config saved to /var/cache/conftool/dbconfig/20191010-055153-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-10T07:55:12Z] <marostegui> Stop MySQL on es1014 es1013 db1084 db1083 db1077 db1076 db1112 db1124 db1118 for on-site PDU maintenance (this will generate lag on labsdb hosts) - T227536

All the databases, es1013, es1014 and dbproxy1014 are good to go.

dbstore1004 needs to be handled by @elukey, so please coordinate with him before working on that one.

aborrero updated the task description. (Show Details)Oct 10 2019, 10:45 AM

aborrero subscribed.

Starting b1-eqiad pdu refresh

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:35:56Z] <arturo> icinga downtime cloudvirt1026 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:36:33Z] <arturo> icinga downtime cloudvirt1025 for 2h (T227536)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T11:37:16Z] <arturo> icinga downtime cloudvirt1023 for 2h (T227536)

Mentioned in SAL (#wikimedia-cloud) [2019-10-10T11:59:53Z] <arturo> network switch hardware is down affecting cloudvirt1025/1026 (T227536) VMs are supposed to be online but unreachable

Swap is finished everything is back up with redundant power..
updated netbox with for old and new pdu.

Jclark-ctr reassigned this task from Jclark-ctr to RobH.Oct 10 2019, 12:51 PM

Jclark-ctr updated the task description. (Show Details)

Jclark-ctr subscribed.

Change 542321 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

gerritbot added a project: Patch-For-Review.Oct 11 2019, 7:45 AM

Change 542321 merged by Ayounsi:
[operations/puppet@production] PDUs: add model sentry 4 to eqiad b1 and a2

https://gerrit.wikimedia.org/r/542321

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2019, 8:10 AM

RobH closed this task as Resolved.Oct 22 2019, 4:40 PM

RobH removed RobH as the assignee of this task.