a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RobH
	Jul 2 2019, 7:58 PM

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A2-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

- schedule downtime for the entire list of switches and servers.
- Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
- Once new PDU tower is confirmed online, move on to next steps.
- Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
- confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

device	role	SRE team coordination	notes
asw2-a2-eqiad	asw	@ayounsi
conf1001	zookeeper/etcd	serviceops	to be decommed
kafka1023	kafka	Analytics	to be decommed
kafka1013	kafka	Analytics	to be decommed
kafka1012	kafka	Analytics	to be decommed
db1107	eventlogging db	Analytics	please ping analytics to stop data flowing to the db temporarily
tungsten
cloudelastic1001		Discovery-Search	@Gehel good to go
kafka-jumbo1002	kafka	Analytics	ok to proceed
ms-be1045	ms-be	@fgiunchedi	poweroff / poweron
ms-be1044	ms-be	@fgiunchedi	poweroff / poweron
an-worker1079	analytics	Analytics
db1082	db	DBA	@Marostegui to depool this host
db1081	db	DBA	@Marostegui to depool this host
db1080	db	DBA	@Marostegui to depool this host
db1079	db	DBA	@Marostegui to depool this host
db1075	db	DBA	@Marostegui to depool this host
db1074	db	DBA	@Marostegui to depool this host, needs to be powered off as it has a broken PSU
ms-be1019	ms-be	@fgiunchedi	poweroff / poweron
es1011	external store	DBA	@Marostegui to depool this host
an-worker1078	analytics	Analytics	ok to proceed

Details

	Subject	Repo	Branch	Lines +/-
	db-eqiad.php: Depool es1011	operations/mediawiki-config	master	+1 -1
	wmnet: Failover dbproxy1001 to dbproxy1006	operations/dns	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
Resolved	Jclark-ctr	T227138 a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Resolved	Marostegui	T230783 Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC
Resolved	Jclark-ctr	T233534 db1075 (s3 master) crashed - BBU failure
		Unknown Object (Task)
Declined	None	T233569 Batch db1074-db1079 hosts having BBU issues
Resolved	Kormat	T233684 Make primary DB masters page on HOST DOWN alert
Resolved	Marostegui	T322987 db2173 crashed and didn't alert
Resolved	Papaul	T322988 db2173 HW errors
Resolved	Marostegui	T230784 Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC
Resolved	Trizek-WMF	T230788 Community Relations support needed for several read-only windows (s2, s3, s4 and s8)
Resolved	Jclark-ctr	T235190 fix serial connection for ps1-a2-eqiad

Event Timeline

RobH created this task.Jul 2 2019, 7:58 PM

RobH updated the task description. (Show Details)

RobH mentioned this in T226778: Install new PDUs in rows A/B (Top level tracking task).Jul 2 2019, 8:02 PM

RobH triaged this task as Medium priority.Jul 2 2019, 8:06 PM

RobH updated the task description. (Show Details)

RobH added a subscriber: ayounsi.

RobH added a subscriber: fgiunchedi.

The kafka10XX hosts are going to be decommed in T226517, so not a concern. The other hosts can go down without horrible consequences :)

I assume that you'll do one rack at the time, but asking anyway: in T226782 (a1) there is another kafka-jumbo host scheduled for maintenance, so it would be great if both of them wouldn't be at risk of loosing power at the same time.

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: Znuny Recommendation-API SRE-tools

Marostegui updated the task description. (Show Details)Jul 22 2019, 3:01 PM

Marostegui updated the task description. (Show Details)

Change 524805 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

gerritbot added a project: Patch-For-Review.Jul 22 2019, 3:22 PM

Change 524805 merged by Marostegui:
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

Maintenance_bot removed a project: Patch-For-Review.Jul 23 2019, 5:10 AM

conf1001 is fine to powerdown (no depool necessary), perform all wanted actions and then poweron as it will repool itself automatically

For ms-be same as T227140: a4-eqiad pdu refresh

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:24 AM

RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 7:18 PM

RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM

RobH removed RobH as the assignee of this task.Aug 14 2019, 4:52 PM

wiki_willy renamed this task from a2-eqiad pdu refresh to a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).Aug 15 2019, 5:30 PM

We have to masters on this rack db1075 (s3) and db1104 (s4).
@wiki_willy how confident are you guys that this won't have an unexpected downtime? (cc @jcrespo)

Marostegui updated the task description. (Show Details)Aug 19 2019, 10:32 AM

Gehel updated the task description. (Show Details)Aug 19 2019, 4:15 PM

Gehel subscribed.

Marostegui mentioned this in T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC.Aug 20 2019, 10:13 AM

Marostegui mentioned this in T230784: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC.

Marostegui closed subtask T230783: Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC as Resolved.Sep 24 2019, 5:39 AM

Marostegui updated the task description. (Show Details)Sep 24 2019, 6:29 AM

Marostegui updated the task description. (Show Details)Sep 26 2019, 5:14 AM

Marostegui closed subtask T230784: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC as Resolved.Sep 26 2019, 5:30 AM

elukey updated the task description. (Show Details)Oct 2 2019, 6:26 AM

Change 541148 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1011

https://gerrit.wikimedia.org/r/541148

gerritbot added a project: Patch-For-Review.Oct 7 2019, 6:21 AM

Change 541148 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1011

https://gerrit.wikimedia.org/r/541148

Mentioned in SAL (#wikimedia-operations) [2019-10-07T06:25:14Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1011 T227138 (duration: 01m 10s)

db1074 has a broken PSU and the new PSU is scheduled to arrive the 10th (T233567#5544445), so I will power off this host and will need to be powered on back @Cmjohnson or @Jclark-ctr

Maintenance_bot removed a project: Patch-For-Review.Oct 7 2019, 7:10 AM

Marostegui updated the task description. (Show Details)Oct 7 2019, 7:32 AM

Marostegui mentioned this in Unknown Object (Task).Oct 7 2019, 7:54 AM

wiki_willy assigned this task to • Cmjohnson.Oct 7 2019, 3:41 PM

Mentioned in SAL (#wikimedia-operations) [2019-10-08T05:41:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1082 db1081 db1080 db1079 db1075 db1074 for PDU maintenance T227138', diff saved to https://phabricator.wikimedia.org/P9254 and previous config saved to /var/cache/conftool/dbconfig/20191008-054127-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-08T06:48:45Z] <marostegui> Stop MySQL on es011 db1082 db1081 db1080 db1079 db1075 db1074 (replication lag will appear on labs for s5) for on-site maintenance T227138

@Cmjohnson the following hosts are good to go: db1082 db1081 db1080 db1079 db1075 db1074 es1011
Please note:

db1074 has been powered off as it has a broken PSU, so please turn it back ON once the maintenance is done
db1107 is owned by Analytics, so please ping them before working with it unless they say otherwise.

Mentioned in SAL (#wikimedia-operations) [2019-10-08T12:27:11Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1012 T227138 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-10-08T12:38:40Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool es1012 T227138 (duration: 00m 51s)

the pdu swap is over, we did lose an-worker1079 due to the PSUs not failing over. Everything is cabled and they're linked together. still needs updating.

Re-assigning to @RobH to complete install/updating of new PDU. Thanks, Willy

I've just attempted to connect to ps1-a2-eqiad via serial, and failed. To fix this, I'll outline the steps needed below and after coordination with @wiki_willy, determined best to assign this to @Jclark-ctr to fix (though @Cmjohnson is also able to do so, either can steal this task as needed.)

Please note these steps assume @Jclark-ctr has his shell access (in the dc ops group) working on his laptop (its active on the cluster.) If he doesn't have his config setup for this yet, please ping me in IRC and I'll assist you in your ssh config/setup.

I'll assume John is doing this, so I'll outline the full steps needed to fully fix and test the fix before handing this back to me.

ps1-a2-eqiad's serial console port (orange cable) should be connected to both the PDU tower, and then back to port 2 on scs-a8-eqiad (the opengear console in rack A8).
Once it is connected, you can test the serial connection as follows:
- ssh root@scs-a8-eqiad.mgmt.eqiad.wmnet and use the management scs password.
- once connected, run pmshell and hit enter. It will list all ports, pick port 2 and hit enter.
- it should prompt with a login screen, if it doesn't, the serial connection is failing.

If the serial connection is failing, then the orange patch cable may need to have the ends re-crimped or the cable replaced. If this patch uses the black in-line adapter (on the PDU side of the orange cable), then you can use a standard orange patch cable. If it doesn't have an in-line adapter, you'll have to make a special cable. Please coordinate with @RobH before you do so, as we may just temp use the adjacent rack serial to get this setup quickly.

@Jclark-ctr and I went through the following to fix this issue:

tested (failed) scs-a8-eqiad port 2 to ps1-a2-eqiad connection
tested (works) scs-a8-eqiad:3 to ps1-a3-eqiad
moved ps1-a3-eqiad connection to ps1-a2-eqiad and it worked (so the PDU serial is functional)
moved the working connection from port 3 to port 2 on the scs, still worked (scs is functional)
determined it was a bad cable between scs-a8-eqiad:port2 and ps1-a2-eqiad.

End result is I'll make a sub-task for that repair to take place. While we had working serial during the testing, we went ahead and setup the network and unblocked this deployment.

Please note that with the temp serial run, we went ahead and setup ps1-a2-eqiad.

The existing serial needs to be fixed though.

I should not have resolved this so quickly, as it needs a few other things handled.

I just went ahead and put the old pdu to its asset tag name and updated the hostname for the new pdu for ps1-a2-eqiad in netbox.

However, I did not add in ps2-a2-eqiad, as I cannot tell what its asset tag or serial number is from polling the device. (I was able to get the serial for ps1, and update netbox.)

Please update ps2-a2-eqiad with whatever PDU link tower/serial number/asset tag is there. IIRC you already put all of the PDU towers into netbox with serial number + asset tag, so just need to update the one installed with the hostname and location.

updated ps2-a2-eqiad and location set to active.

RobH closed subtask T235190: fix serial connection for ps1-a2-eqiad as Resolved.Oct 24 2019, 6:44 PM

a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)Closed, ResolvedPublicActions