Install new PDUs in rows A/B (Top level tracking task)
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	RobH
	Jun 27 2019, 10:06 PM

Description

This task will track the overall upgrade of racks in rows A and B (other than b5-eqiad, which was already upgraded via T223126). All of the PDUs on this task were ordered via T223461.

Please note that of the 15 sets of PDUs, two sets are for the networking racks and have special flexible power sockets for the C19 requirements for the MX480s.

Since each rack will have a checklist of servers unique to that rack, this task will simply list each of those sub-tasks as they are created.

Every attempt will be made to ensure hardware with redundant power doesn't lose power entirely, and should be able to migrate without downtime. However, mistakes happen and due to cabling constraints, accidental power loss may occur.

Scheduling: Upgrades will continue based on the following schedule below, between Sept 10 - Nov 4, targeting Tuesdays and Thursdays (one rack per day) at 11am UTC (7am ET)

Racks for upgrade:

List of all EQIAD racks in netbox

Row A:
A1 - Networking Rack - T226782 - Date TBD
A2 - 10G Rack - T227138 - Tuesday, 10/8 11am UTC (7a-10a ET)
A3 - 1G Rack - T227139 - DONE
A4 - 10G Rack - T227140 - DONE
A5 - 1G Rack - T227141 - DONE
A6 - 1G Rack - T227142 - Tuesday, 10/22 11am UTC (7a-10a ET)
A7 - 10G Rack - T227143 - DONE
A8 - Networking Rack - T227133 - Date TBA

Row B:
B1 - T227536 - Thursday, 10/10 11am UTC (7a-10a ET)
B2 - 10G Rack - T227538 - Tuesday, 10/29 11am UTC (7a-10a ET)
B3 - T227539 - DONE
B4 - 10G Rack - T227540 - Thursday, 10/24 11am UTC (7a-10a ET)
B5 - complete - DONE
B6 - T227541 - Tuesday, 9/10 11am UTC (7a-10a ET)
B7 - 10G Rack - T227542 - 11/5 11am UTC (7a-10a ET)
B8 - T227543 - Thursday, 10/31 11am UTC (7a-10a ET)

Please note the PDUs themselves will be recorded into netbox with their asset tags & serials (no hostnames) via T229284.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Cmjohnson	T226778 Install new PDUs in rows A/B (Top level tracking task)
Resolved	Jclark-ctr	T226782 a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC)
Resolved	Jclark-ctr	T233273 labsdb1009 broken PSU
		Unknown Object (Task)
Resolved	None	T227133 a8-eqiad pdu refresh (Thursday 10/17 @11am UTC)
Resolved	Jclark-ctr	T227138 a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Resolved	• Marostegui	T230783 Switchover s3 primary database master db1075 -> db1123 - 24th Sept @05:00 UTC
Resolved	Jclark-ctr	T233534 db1075 (s3 master) crashed - BBU failure
		Unknown Object (Task)
Declined	None	T233569 Batch db1074-db1079 hosts having BBU issues
Resolved	Kormat	T233684 Make primary DB masters page on HOST DOWN alert
Resolved	• Marostegui	T322987 db2173 crashed and didn't alert
Resolved	Papaul	T322988 db2173 HW errors
Resolved	• Marostegui	T230784 Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC
Resolved	Jclark-ctr	T235190 fix serial connection for ps1-a2-eqiad
Resolved	None	T227139 a3-eqiad pdu refresh
Resolved	None	T227140 a4-eqiad pdu refresh
Resolved	None	T227141 a5-eqiad pdu refresh
Resolved	None	T227142 a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC)
Resolved	• Marostegui	T230785 Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC
Resolved	RobH	T227143 a7-eqiad pdu refresh
Resolved	None	T227536 b1-eqiad pdu refresh (Thursday 10/10 @11am UTC)
Resolved	RobH	T227538 b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC)
Resolved	Jclark-ctr	T227539 b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC)
Resolved	• Marostegui	T230762 Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC
Resolved	Trizek-WMF	T230788 Community Relations support needed for several read-only windows (s2, s3, s4 and s8)
Resolved	None	T227540 b4-eqiad pdu refresh (Thursday 10/24 @11am UTC)
Resolved	• Cmjohnson	T227541 b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC)
Resolved	Jclark-ctr	T227542 b7-eqiad pdu refresh (Tuesday 11/5 @12pm UTC)
Resolved	None	T227543 b8-eqiad pdu refresh (Thursday 10/31 @11am UTC)
Invalid	None	T228859 dbproxy1012 and dbprov1001 alerting on PS Redundancy
Resolved	• Cmjohnson	T228892 dbproxy1012 alerting on PS Redundancy
Resolved	• Cmjohnson	T228891 dbprov1001 alerting on PS Redundancy
Resolved	Jclark-ctr	T229284 add all remaining new pdus to netbox
Resolved	None	T233129 update puppet for new PDU models
Duplicate	None	T229328 ps1 eqiad Icinga UNKNOWNs

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

RobH updated the task description. (Show Details)Jun 27 2019, 10:27 PM

RobH updated the task description. (Show Details)

RobH updated the task description. (Show Details)Jul 2 2019, 7:02 PM

RobH updated the task description. (Show Details)Jul 2 2019, 8:02 PM

RobH updated the task description. (Show Details)Jul 2 2019, 8:08 PM

RobH renamed this task from install new PDUs in rows A/B (Top level tracking task) to (July 22-26) install new PDUs in rows A/B (Top level tracking task).Jul 2 2019, 8:39 PM

RobH set Due Date to Jul 26 2019, 12:00 AM.

Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptJul 2 2019, 8:40 PM

RobH updated the task description. (Show Details)Jul 8 2019, 10:43 PM

RobH updated the task description. (Show Details)Jul 8 2019, 10:50 PM

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

RobH mentioned this in T227140: a4-eqiad pdu refresh.Jul 22 2019, 2:47 PM

In T227138#5354060, @Marostegui wrote:

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: Znuny Recommendation-API SRE-tools

Question for @ayounsi:

as long as we define librenms in the syslog settings (and set proper copy of snmp and the like) will these all report in correctly as we replace them?

RobH closed this task as Resolved.Jul 22 2019, 6:56 PM

RobH reopened this task as Open.

RobH closed subtask T227140: a4-eqiad pdu refresh as Resolved.Jul 22 2019, 7:02 PM

In T226778#5354658, @RobH wrote:

In T227138#5354060, @Marostegui wrote:

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: Znuny Recommendation-API SRE-tools

@RobH I have failed over dbproxy1001 to dbproxy1006 so A3 is also good to go.

RobH updated the task description. (Show Details)Jul 23 2019, 1:36 PM

RobH closed subtask T227139: a3-eqiad pdu refresh as Resolved.Jul 23 2019, 3:17 PM

The more "easy" racks for us in row B are B3 and B6. I propose we start with these.

rack B3 contains cloudvirt1027 and we would like to reallocate at least these VMs before doing the operations:

tools-puppetmaster-01
tools-docker-registry-04
proxy-01

rack B6 contains cloudvirt1029, and just noting here this important VM:

clouddb-wikilabels-02 (this VM is currently the secondary in the wikilables DB cluster)

I can have those servers ready to go tomorrow 2019-07-24 before your awake hours @RobH Also I need to send an email announcement to affected users (plenty of other VMs in those servers).

RobH closed subtask T227141: a5-eqiad pdu refresh as Resolved.Jul 23 2019, 7:11 PM

RobH updated the task description. (Show Details)

• Cmjohnson closed subtask T228859: dbproxy1012 and dbprov1001 alerting on PS Redundancy as Invalid.Jul 24 2019, 3:36 PM

• Marostegui added a subtask: T228892: dbproxy1012 alerting on PS Redundancy.Jul 25 2019, 4:38 AM

@RobH if you guys don't have any preference on which rack to start with...from the DB side, B3 can be a good option if it can be done before Tuesday 30th.
A month ago we scheduled a failover (T227062) for our s8 (wikidata) primary db master, and the new master (db1104) will be in B3, so if that rack can be done before Tuesday 30th, that's one less master we need to worry about :)

• Marostegui added a subtask: T228891: dbprov1001 alerting on PS Redundancy.Jul 25 2019, 4:43 AM

RobH closed subtask T228892: dbproxy1012 alerting on PS Redundancy as Resolved.Jul 25 2019, 2:14 PM

RobH closed subtask T228891: dbprov1001 alerting on PS Redundancy as Resolved.Jul 25 2019, 2:17 PM

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

In T226778#5368034, @fgiunchedi wrote:

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

Seems like we'll have to adapt the icinga checks / monitoring, filed as T229101: Phase monitoring for new PDUs

RobH removed Due Date.Jul 26 2019, 1:36 PM

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptJul 26 2019, 1:36 PM

RobH renamed this task from (July 22-26) install new PDUs in rows A/B (Top level tracking task) to Install new PDUs in rows A/B (Top level tracking task).Jul 26 2019, 1:36 PM

RobH moved this task from Backlog to Blocked on the ops-eqiad board.

RobH updated the task description. (Show Details)Jul 29 2019, 10:17 PM

In reviewing the comments of T227138#5354060 and T226778#5358383, and in my IRC discussions with @wiki_willy, I propose the following schedule of rack swaps and cadence options.

Scheduling (Chris & James):

The current plan is 1 rack swap per day, allowing time for service migrations between racks and not requiring SRE sub-teams be online and attending for more than a few hours. Preference is for Tuesday and Thursday, but this won't work for every week due to the Chris/James overlap required.

What to do for your work?

If work is occurring in a rack that you have a server or service in, you will need to review the level of redundancy and crash recovery within your service. While we attempt to prevent power loss, accidents happen and we're working in live racks with many, many cables routed through them. Some services depool a server (cp, ms-fe) and leave them online, while others shut down power and services on a server (ms-be) or simply shift it from master usage (db).

2019-08-13 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
B6 - T227541

B6 is listed as not having any DB masters, and one of the two easiest racks for Cloud Services. Everyone will need to review T227541 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-14 - Wednesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A1 - T226782

A1 is one of our two primary network racks. DC-ops and Netops were ready to move this server, but it was postponed to ensure full review of all services within it before the window occurs. Please review T226782 and see if one of your servers/services runs on that rack, and if so, take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-20 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)

B3 - T227539

B3 is listed by Cloud Services as being their second easiest rack in row B via T226778#5358383. Cloud services will need to migrate some items, and the DBA team may need to migrate the wikitech master (up to Cloud Services per T227138#5354060?). Everyone will need to review T227539 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-22 - Thursday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A8 - T227133

A8 is the last of our two primary networking racks. Everyone will need to review T227133 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a blocker though. Our servers are mostly ready for the operations, and will re-review them a day before to ensure nothing new (important VM) were scheduled to run there.
So, ACK, good to go.

When the time comes to upgrade PDUs puppet should be updated too to reflect the new reality, specifically the facilities module to either add model => 'sentry4' to an existing pdu entry or add a brand new entry when we're adding new PDUs (e.g. ulsfo). I don't know what's the best way to include the above step when performing the work but noting it here, let me know if there's a better way!

• Cmjohnson updated the task description. (Show Details)Aug 15 2019, 3:18 PM

wiki_willy updated the task description. (Show Details)Aug 15 2019, 5:28 PM

• Marostegui mentioned this in T230788: Community Relations support needed for several read-only windows (s2, s3, s4 and s8).Aug 20 2019, 10:35 AM

RobH closed subtask T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) as Resolved.Sep 17 2019, 6:21 PM

It seems that when the new PDU goes into place, it fails the icinga checks for:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a7-eqiad

ps1-a7-eqiad-infeed-load-tower-A-phase-X
ps1-a7-eqiad-infeed-load-tower-A-phase-Y
ps1-a7-eqiad-infeed-load-tower-A-phase-Z
ps1-a7-eqiad-infeed-load-tower-B-phase-X
ps1-a7-eqiad-infeed-load-tower-B-phase-Y
ps1-a7-eqiad-infeed-load-tower-B-phase-Z

this happens across all the new PDU towers as they come online in icinga and clear their ping check.

Seems we already have T229328.

RobH added a subtask: T229328: ps1 eqiad Icinga UNKNOWNs.Sep 17 2019, 10:27 PM

For additonal context, the UNKNOWN / phase monitoring for new PDUs is tracked here: T229101: Phase monitoring for new PDUs and the reason AFAICT is the SNMP OID change from sentry3 -> sentry4 for phases, which will need adjusting in the checks too (related but different issue than T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring)

• Marostegui updated the task description. (Show Details)Sep 18 2019, 8:23 AM

• Marostegui updated the task description. (Show Details)Sep 18 2019, 8:26 AM

• Marostegui updated the task description. (Show Details)

• Marostegui updated the task description. (Show Details)Sep 18 2019, 8:32 AM

ayounsi reopened subtask T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) as Open.Sep 18 2019, 6:55 PM

wiki_willy closed subtask T227541: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) as Resolved.Oct 9 2019, 4:35 PM

Jclark-ctr closed subtask T227539: b3-eqiad pdu refresh (Tuesday 9/17 @11am UTC) as Resolved.Oct 10 2019, 2:40 PM

RobH closed subtask T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) as Resolved.Oct 10 2019, 3:40 PM

RobH reopened subtask T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) as Open.Oct 10 2019, 8:04 PM

Can I suggest a few modifications to the PDU swap checklist of each task? Mostly to clear out the alerting noise
Under: "schedule downtime for the entire list of switches and servers"
Add:
[] Downtime PDUs in Icinga for the time of the maintenance + time for the new one to get re-configured
I know this can be controversial as people use Icinga different ways, but I believe this is best practice

then add the following at the end of the checklist:

[] Reconfigure new PDU (network, SNMP, etc...)
[] Update the Icinga check config in Puppet and set the model to sentry4 - see for example: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/542321
[] Check that both Icinga and LibreNMS are all green

Hi @ayounsi - I talked to a couple other people who had the same concern the other day, and I agree as well...so I started scheduling downtime for the PDU alerts in Icinga starting from today's B1 PDU upgrade, and will continue for the remaining PDU swaps. Thanks, Willy

Jclark-ctr closed subtask T227138: a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC) as Resolved.Oct 11 2019, 11:49 PM

RobH closed subtask T227536: b1-eqiad pdu refresh (Thursday 10/10 @11am UTC) as Resolved.Oct 22 2019, 4:40 PM

RobH closed subtask T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) as Resolved.Oct 22 2019, 4:56 PM

RobH closed subtask T233129: update puppet for new PDU models as Resolved.Oct 22 2019, 7:00 PM

RobH closed subtask T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC) as Resolved.Oct 24 2019, 6:20 PM

RobH removed RobH as the assignee of this task.Oct 24 2019, 6:42 PM

RobH closed subtask T227540: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) as Resolved.

RobH closed subtask T227133: a8-eqiad pdu refresh (Thursday 10/17 @11am UTC) as Resolved.

RobH closed subtask T227143: a7-eqiad pdu refresh as Resolved.Oct 24 2019, 10:56 PM

RobH closed subtask T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC) as Resolved.Oct 29 2019, 11:03 PM

RobH closed subtask T227543: b8-eqiad pdu refresh (Thursday 10/31 @11am UTC) as Resolved.Oct 31 2019, 10:56 PM

Jclark-ctr closed subtask T229284: add all remaining new pdus to netbox as Resolved.Nov 5 2019, 11:59 AM