Page MenuHomePhabricator

Install new PDUs in rows A/B (Top level tracking task)
Closed, ResolvedPublic0 Estimated Story Points

Description

This task will track the overall upgrade of racks in rows A and B (other than b5-eqiad, which was already upgraded via T223126). All of the PDUs on this task were ordered via T223461.

Please note that of the 15 sets of PDUs, two sets are for the networking racks and have special flexible power sockets for the C19 requirements for the MX480s.

Since each rack will have a checklist of servers unique to that rack, this task will simply list each of those sub-tasks as they are created.

Every attempt will be made to ensure hardware with redundant power doesn't lose power entirely, and should be able to migrate without downtime. However, mistakes happen and due to cabling constraints, accidental power loss may occur.

Scheduling: Upgrades will continue based on the following schedule below, between Sept 10 - Nov 4, targeting Tuesdays and Thursdays (one rack per day) at 11am UTC (7am ET)

Racks for upgrade:

List of all EQIAD racks in netbox

Row A:
A1 - Networking Rack - T226782 - Date TBD
A2 - 10G Rack - T227138 - Tuesday, 10/8 11am UTC (7a-10a ET)
A3 - 1G Rack - T227139 - DONE
A4 - 10G Rack - T227140 - DONE
A5 - 1G Rack - T227141 - DONE
A6 - 1G Rack - T227142 - Tuesday, 10/22 11am UTC (7a-10a ET)
A7 - 10G Rack - T227143 - DONE
A8 - Networking Rack - T227133 - Date TBA

Row B:
B1 - T227536 - Thursday, 10/10 11am UTC (7a-10a ET)
B2 - 10G Rack - T227538 - Tuesday, 10/29 11am UTC (7a-10a ET)
B3 - T227539 - DONE
B4 - 10G Rack - T227540 - Thursday, 10/24 11am UTC (7a-10a ET)
B5 - complete - DONE
B6 - T227541 - Tuesday, 9/10 11am UTC (7a-10a ET)
B7 - 10G Rack - T227542 - 11/5 11am UTC (7a-10a ET)
B8 - T227543 - Thursday, 10/31 11am UTC (7a-10a ET)

Please note the PDUs themselves will be recorded into netbox with their asset tags & serials (no hostnames) via T229284.

Related Objects

StatusSubtypeAssignedTask
Resolved Cmjohnson
ResolvedJclark-ctr
ResolvedJclark-ctr
ResolvedNone
ResolvedJclark-ctr
Resolved Marostegui
ResolvedJclark-ctr
DeclinedNone
ResolvedKormat
Resolved Marostegui
ResolvedPapaul
Resolved Marostegui
ResolvedJclark-ctr
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
Resolved Marostegui
ResolvedRobH
ResolvedNone
ResolvedRobH
ResolvedJclark-ctr
Resolved Marostegui
ResolvedTrizek-WMF
ResolvedNone
Resolved Cmjohnson
ResolvedJclark-ctr
ResolvedNone
InvalidNone
Resolved Cmjohnson
Resolved Cmjohnson
ResolvedJclark-ctr
ResolvedNone
DuplicateNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RobH updated the task description. (Show Details)
RobH renamed this task from install new PDUs in rows A/B (Top level tracking task) to (July 22-26) install new PDUs in rows A/B (Top level tracking task).Jul 2 2019, 8:39 PM
RobH set Due Date to Jul 26 2019, 12:00 AM.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptJul 2 2019, 8:40 PM

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: Znuny Recommendation-API SRE-tools

Question for @ayounsi:

as long as we define librenms in the syslog settings (and set proper copy of snmp and the like) will these all report in correctly as we replace them?

RobH reopened this task as Open.

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: Znuny Recommendation-API SRE-tools

@RobH I have failed over dbproxy1001 to dbproxy1006 so A3 is also good to go.

The more "easy" racks for us in row B are B3 and B6. I propose we start with these.

rack B3 contains cloudvirt1027 and we would like to reallocate at least these VMs before doing the operations:

  • tools-puppetmaster-01
  • tools-docker-registry-04
  • proxy-01

rack B6 contains cloudvirt1029, and just noting here this important VM:

  • clouddb-wikilabels-02 (this VM is currently the secondary in the wikilables DB cluster)

I can have those servers ready to go tomorrow 2019-07-24 before your awake hours @RobH Also I need to send an email announcement to affected users (plenty of other VMs in those servers).

@RobH if you guys don't have any preference on which rack to start with...from the DB side, B3 can be a good option if it can be done before Tuesday 30th.
A month ago we scheduled a failover (T227062) for our s8 (wikidata) primary db master, and the new master (db1104) will be in B3, so if that rack can be done before Tuesday 30th, that's one less master we need to worry about :)

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

Seems like we'll have to adapt the icinga checks / monitoring, filed as T229101: Phase monitoring for new PDUs

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptJul 26 2019, 1:36 PM
RobH renamed this task from (July 22-26) install new PDUs in rows A/B (Top level tracking task) to Install new PDUs in rows A/B (Top level tracking task).Jul 26 2019, 1:36 PM
RobH moved this task from Backlog to Blocked on the ops-eqiad board.

In reviewing the comments of T227138#5354060 and T226778#5358383, and in my IRC discussions with @wiki_willy, I propose the following schedule of rack swaps and cadence options.

Scheduling (Chris & James):

The current plan is 1 rack swap per day, allowing time for service migrations between racks and not requiring SRE sub-teams be online and attending for more than a few hours. Preference is for Tuesday and Thursday, but this won't work for every week due to the Chris/James overlap required.

What to do for your work?

If work is occurring in a rack that you have a server or service in, you will need to review the level of redundancy and crash recovery within your service. While we attempt to prevent power loss, accidents happen and we're working in live racks with many, many cables routed through them. Some services depool a server (cp, ms-fe) and leave them online, while others shut down power and services on a server (ms-be) or simply shift it from master usage (db).

2019-08-13 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
B6 - T227541

B6 is listed as not having any DB masters, and one of the two easiest racks for Cloud Services. Everyone will need to review T227541 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-14 - Wednesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A1 - T226782

A1 is one of our two primary network racks. DC-ops and Netops were ready to move this server, but it was postponed to ensure full review of all services within it before the window occurs. Please review T226782 and see if one of your servers/services runs on that rack, and if so, take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-20 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)

B3 - T227539

B3 is listed by Cloud Services as being their second easiest rack in row B via T226778#5358383. Cloud services will need to migrate some items, and the DBA team may need to migrate the wikitech master (up to Cloud Services per T227138#5354060?). Everyone will need to review T227539 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-22 - Thursday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A8 - T227133

A8 is the last of our two primary networking racks. Everyone will need to review T227133 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a blocker though. Our servers are mostly ready for the operations, and will re-review them a day before to ensure nothing new (important VM) were scheduled to run there.
So, ACK, good to go.

When the time comes to upgrade PDUs puppet should be updated too to reflect the new reality, specifically the facilities module to either add model => 'sentry4' to an existing pdu entry or add a brand new entry when we're adding new PDUs (e.g. ulsfo). I don't know what's the best way to include the above step when performing the work but noting it here, let me know if there's a better way!

It seems that when the new PDU goes into place, it fails the icinga checks for:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a7-eqiad

ps1-a7-eqiad-infeed-load-tower-A-phase-X
ps1-a7-eqiad-infeed-load-tower-A-phase-Y
ps1-a7-eqiad-infeed-load-tower-A-phase-Z
ps1-a7-eqiad-infeed-load-tower-B-phase-X
ps1-a7-eqiad-infeed-load-tower-B-phase-Y
ps1-a7-eqiad-infeed-load-tower-B-phase-Z

this happens across all the new PDU towers as they come online in icinga and clear their ping check.

Seems we already have T229328.

For additonal context, the UNKNOWN / phase monitoring for new PDUs is tracked here: T229101: Phase monitoring for new PDUs and the reason AFAICT is the SNMP OID change from sentry3 -> sentry4 for phases, which will need adjusting in the checks too (related but different issue than T148541: Replace Torrus with Prometheus snmp_exporter for PDUs monitoring)

Can I suggest a few modifications to the PDU swap checklist of each task? Mostly to clear out the alerting noise
Under: "schedule downtime for the entire list of switches and servers"
Add:
[] Downtime PDUs in Icinga for the time of the maintenance + time for the new one to get re-configured
I know this can be controversial as people use Icinga different ways, but I believe this is best practice

then add the following at the end of the checklist:

[] Reconfigure new PDU (network, SNMP, etc...)
[] Update the Icinga check config in Puppet and set the model to sentry4 - see for example: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/542321
[] Check that both Icinga and LibreNMS are all green

Hi @ayounsi - I talked to a couple other people who had the same concern the other day, and I agree as well...so I started scheduling downtime for the PDU alerts in Icinga starting from today's B1 PDU upgrade, and will continue for the remaining PDU swaps. Thanks, Willy

wiki_willy assigned this task to Cmjohnson.
wiki_willy added a subscriber: Jclark-ctr.

Resolving parent task for PDU upgrades. Much appreciated to @Cmjohnson and @Jclark-ctr for taking care of these. Thanks, Willy