a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	RobH
	Jul 2 2019, 8:00 PM

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A6-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

- schedule downtime for the entire list of switches and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- confirm serial works to the new PDU (it does not as of 2019-10-22 @ 17:08 GMT)
- setup PDU following directions on https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/ServerTech#Initial_Setup
- update PDU model in puppet per T233129.

List of routers, switches, and servers

device	role	SRE team coordination	notes
asw2-a6-eqiad	asw	@ayounsi
pc1007	parsercache	DBA	can be failed over easily @Marostegui to depool this host
wtp1027	parsoid	serviceops	fine to do at any time
wtp1026	parsoid	serviceops	fine to do at any time
wtp1025	parsoid	serviceops	fine to do at any time
an-master1001		Analytics	fine to do any time
dbproxy1013	dbproxy	DBA	not active
elastic1045	cirrus-search	Discovery-Search	@Gehel good to go
elastic1044	cirrus-search	Discovery-Search	@Gehel good to go
elastic1048	cirrus-search	Discovery-Search	@Gehel good to go
mc1023	mc	serviceops @elukey	fine to do at any time outside of deployment windows
mc1022	mc	serviceops @elukey	fine to do at any time outside of deployment windows
mc1021	mc	serviceops @elukey	fine to do at any time outside of deployment windows
mc1020	mc	serviceops @elukey	fine to do at any time outside of deployment windows
mc1019	mc	serviceops @elukey	fine to do at any time outside of deployment windows
aqs1007		Analytics	fine to do any time
weblog1001		fine to do any time but it may disrupt some webrequest monitoring that we rely on, Cc: @godog
restbase1021	restbase	@jijiki	ok with power loss
labsdb1012	labsdb	Analytics	Analytics to confirm if MySQL can be stopped
db1066	db	DBA	Host powered off, DO NOT POWER ON - pending on-site decommissioning steps T233071
db1116	db	DBA	backup source, nothing to be done
db1115	db	DBA	tendril host, nothing to be done
labmon1002	labmon	cloud-services-team	can be done anytime
druid1004		Analytics	fine to do any time
wdqs1004	wdqs	Discovery-Search	@Gehel good to go
ores1001	ores	@akosiaris	fine to do at any time
restbase-dev1004			can be done at any time
cloudcontrol1003	openstack control node	cloud-services-team	can be done at any time
mw1312	mw	serviceops	fine to do at any time outside of deployment windows
mw1311	mw	serviceops	fine to do at any time outside of deployment windows
mw1310	mw	serviceops	fine to do at any time outside of deployment windows
mw1309	mw	serviceops	fine to do at any time outside of deployment windows
mw1308	mw	serviceops	fine to do at any time outside of deployment windows
mw1307	mw	serviceops	fine to do at any time outside of deployment windows
ganeti1006	ganeti node	@akosiaris	will need to be emptied in advance
db1096	db	DBA	@Marostegui to depool this host

Details

	Subject	Repo	Branch	Lines +/-
	setting new pdu models	operations/puppet	production	+8 -6
	db-eqiad.php: Temporary pool pc1010 in pc1	operations/mediawiki-config	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
Resolved	None	T227142 a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)
Resolved	• Marostegui	T230785 Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC
Resolved	Trizek-WMF	T230788 Community Relations support needed for several read-only windows (s2, s3, s4 and s8)

Event Timeline

RobH created this task.Jul 2 2019, 8:00 PM

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).

RobH updated the task description. (Show Details)Jul 3 2019, 9:52 PM

RobH added subscribers: ayounsi, akosiaris, fgiunchedi.

elukey updated the task description. (Show Details)Jul 16 2019, 2:12 PM

elukey subscribed.

Analytics side: if possible I'd need some heads up to force a failover for an-master1001.

Memcached side: we have 5 mc10XX shards in the same rack, loosing all of them could be a big problem with the current configuration of mcrouter. Explicitly adding @Joe and @jijiki to understand how to handle this.

emptying ganeti1006 will require some 15-30 mins of advance notice. Emptying it (in case me or @MoritzMuehlenhoff are not around) can be done per https://wikitech.wikimedia.org/wiki/Ganeti#Node_operations using:

sudo gnt-node migrate -f ganeti1006

This rack contains an active primary db master: db1066, this would need to be failed over if we are not confident about not losing power.

• Marostegui updated the task description. (Show Details)Jul 23 2019, 7:06 AM

Joe updated the task description. (Show Details)Jul 23 2019, 7:09 AM

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:29 AM

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 2:17 PM

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:53 PM

wiki_willy renamed this task from a6-eqiad pdu refresh to a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).Aug 15 2019, 5:31 PM

CDanis triaged this task as Medium priority.Aug 16 2019, 1:02 PM

@wiki_willy any advice on T227142#5356294?

• Marostegui updated the task description. (Show Details)Aug 19 2019, 10:36 AM

fgiunchedi updated the task description. (Show Details)Aug 19 2019, 10:43 AM

Gehel updated the task description. (Show Details)Aug 19 2019, 4:17 PM

Gehel subscribed.

@Marostegui - I would say just go for it and fail out in advance, if it's not too much trouble. Master DBs are very critical, so my opinion is to just take the extra precautionary measures. Thanks, Willy

I will get them scheduled, planned etc. Thanks

• Marostegui mentioned this in T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC.Aug 20 2019, 10:21 AM

• Marostegui closed subtask T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC as Resolved.Sep 17 2019, 5:16 AM

• Marostegui updated the task description. (Show Details)Sep 17 2019, 5:19 AM

• Marostegui updated the task description. (Show Details)

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

In T227142#5497901, @Marostegui wrote:

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

We can definitely stop mysql on it, we need labsdb up and running for jobs at the beginning of the month :)

I also added the info about analytics hosts and flipped the requirement of depooling for memcached to "no", since we should do it only if things go on fire :)

In T227142#5497928, @elukey wrote:

In T227142#5497901, @Marostegui wrote:

@elukey for labsdb1012 your Team would need to let us know if MySQL can be stopped for this maintenance (just in case there is powerloss, better to have MySQL stopped, as labs hosts do not have GTID enabled and the risk of corruption can be higher).

We can definitely stop mysql on it, we need labsdb up and running for jobs at the beginning of the month :)

Excellent, thank you. Let's stop replication + mysql then.

• Marostegui updated the task description. (Show Details)Sep 26 2019, 6:59 AM

Change 542890 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Temporary pool pc1010 in pc1

https://gerrit.wikimedia.org/r/542890

gerritbot added a project: Patch-For-Review.Oct 14 2019, 10:05 AM

wiki_willy assigned this task to • Cmjohnson.Oct 21 2019, 4:25 PM

• Marostegui updated the task description. (Show Details)Oct 22 2019, 5:20 AM

Change 542890 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Temporary pool pc1010 in pc1

https://gerrit.wikimedia.org/r/542890

Mentioned in SAL (#wikimedia-operations) [2019-10-22T06:43:11Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool pc1010 T227142 (duration: 00m 52s)

Maintenance_bot removed a project: Patch-For-Review.Oct 22 2019, 7:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-22T07:53:49Z] <marostegui> Stop MySQL on db1116 pc1007 db1096:3315, db1096:3316 for PDU maintenance T227142

Mentioned in SAL (#wikimedia-operations) [2019-10-22T08:05:40Z] <marostegui> Stop MySQL on labsdb1012 for PDU work T227142

The following hosts are ready for this maintenance

pc1007
labsdb1012
db1116
db1096
dbproxy1013
db1066. Note this host is powered OFF as it is ready to be decommissioned, do not power it back on

Pending: db1115 which will be confirmed by @jcrespo when ready to proceed.

Mentioned in SAL (#wikimedia-operations) [2019-10-22T10:32:26Z] <jynus> shutting down db1115 in preparation for PDU maintanance, this will make tendril and dbtree unavailable for 2 hours T227142

db1115 is now down, I took the opportunity to upgrade all its system packages, but didn't touch mariadb.

starting PDU Maintenance

aborrero updated the task description. (Show Details)Oct 22 2019, 11:11 AM

finished PDU Maintenance . Netbox updated with new PDU

Mentioned in SAL (#wikimedia-operations) [2019-10-22T12:25:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool pc1007 after PDU maintenance T227142 (duration: 00m 50s)

Change 545337 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] setting new pdu models

https://gerrit.wikimedia.org/r/545337

gerritbot added a project: Patch-For-Review.Oct 22 2019, 4:48 PM

Change 545337 merged by RobH:
[operations/puppet@production] setting new pdu models

https://gerrit.wikimedia.org/r/545337

@wiki_willy requested I step in and setup the software side of things, but cannot do so as serial to this PDU isn't currently working.

Can you troubleshoot the serial connection please? (You should be able to login to the scs console and see if it works, you can ping me and I can teach you how to do this if you like!)

the icinga downtime was set to expire in less than an hour, so I've extended it until 2300 GMT.

• Cmjohnson updated the task description. (Show Details)Oct 22 2019, 7:38 PM

ps1-a6-eqiad is shown as down in icinga, I believe that is expected?

Hi @jijiki - I think there are a couple things that @Jclark-ctr needs to check and resolve, before @RobH can configure it. After that, the alert should go away. Thanks, Willy

@RobH Corrected cable issue on pdu

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:03:39Z] <robh> setting ip info for ps1-a6-eqiad, it is rebooting. T227142

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:20:04Z] <robh> ps1-a6-eqiad setup complete, icinga errors should clear up T227142

RobH closed this task as Resolved.Oct 24 2019, 6:20 PM

RobH removed RobH as the assignee of this task.Oct 24 2019, 6:44 PM

a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)Closed, ResolvedPublicActions