Page MenuHomePhabricator

Install new PDUs in rows A/B (Top level tracking task)
Open, NormalPublic0 Story Points

Description

This task will track the overall upgrade of racks in rows A and B (other than b5-eqiad, which was already upgraded via T223126). All of the PDUs on this task were ordered via T223461.

Please note that of the 15 sets of PDUs, two sets are for the networking racks and have special flexible power sockets for the C19 requirements for the MX480s.

Since each rack will have a checklist of servers unique to that rack, this task will simply list each of those sub-tasks as they are created.

Every attempt will be made to ensure hardware with redundant power doesn't lose power entirely, and should be able to migrate without downtime. However, mistakes happen and due to cabling constraints, accidental power loss may occur.

Scheduling: Upgrades will continue based on the following schedule below, between Sept 10 - Nov 4, targeting Tuesdays and Thursdays (one rack per day) at 11am UTC (7am ET)

Racks for upgrade:

List of all EQIAD racks in netbox

Row A:
A1 - Networking Rack - T226782 -Thursday, 9/12 11am UTC (7a-10a ET)
A2 - 10G Rack - T227138 - Tuesday, 10/8 11am UTC (7a-10a ET)
A3 - 1G Rack - T227139 - DONE
A4 - 10G Rack - T227140 - DONE
A5 - 1G Rack - T227141 - DONE
A6 - 1G Rack - T227142 - Tuesday, 10/22 11am UTC (7a-10a ET)
A7 - 10G Rack - T227143 - DONE
A8 - Networking Rack - T227133 - Thursday, 9/19 11am UTC (7a-10a ET)

Row B: - SCHEDULED FOR 2019-07-24 THROUGH 2019-07-24
B1 - T227536 - Thursday, 10/10 11am UTC (7a-10a ET)
B2 - 10G Rack - T227538 - Tuesday, 10/29 11am UTC (7a-10a ET)
B3 - T227539 - Tuesday, 9/17 11am UTC (7a-10p ET)
B4 - 10G Rack - T227540 - Thursday, 10/24 11am UTC (7a-10a ET)
B5 - complete - DONE
B6 - T227541 - Tuesday, 9/10 11am UTC (7a-10a ET)
B7 - 10G Rack - T227542 - 11/5 11am UTC (7a-10a ET)
B8 - T227543 - Thursday, 10/31 11am UTC (7a-10a ET)

Please note the PDUs themselves will be recorded into netbox with their asset tags & serials (no hostnames) via T229284.

Related Objects

StatusAssignedTask
OpenRobH
OpenCmjohnson
OpenCmjohnson
OpenNone
OpenMarostegui
OpenNone
ResolvedNone
ResolvedNone
ResolvedNone
OpenNone
ResolvedMarostegui
OpenNone
OpenNone
OpenNone
ResolvedRobH
ResolvedMarostegui
OpenTrizek-WMF
OpenNone
OpenCmjohnson
OpenNone
OpenNone
InvalidNone
ResolvedCmjohnson
ResolvedCmjohnson
OpenJclark-ctr
OpenRobH
DuplicateNone

Event Timeline

RobH triaged this task as Normal priority.Jun 27 2019, 10:06 PM
RobH created this task.
RobH updated the task description. (Show Details)Jun 27 2019, 10:27 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Jul 2 2019, 7:02 PM
RobH updated the task description. (Show Details)Jul 2 2019, 8:02 PM
RobH updated the task description. (Show Details)Jul 2 2019, 8:08 PM
RobH renamed this task from install new PDUs in rows A/B (Top level tracking task) to (July 22-26) install new PDUs in rows A/B (Top level tracking task).Jul 2 2019, 8:39 PM
RobH set Due Date to Jul 26 2019, 12:00 AM.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptJul 2 2019, 8:40 PM
RobH updated the task description. (Show Details)Jul 8 2019, 10:43 PM
RobH updated the task description. (Show Details)Jul 8 2019, 10:50 PM

Please note I've chatted with @fgiunchedi about ms-be systems, and the preferred method of dealing with them in any rack in which we are doing PDU swaps is to downtime the host in icinga, and then power it off. Perform the PDU swaps, and once fully done, power back up the host and it will run puppet and re-pool itself.

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:
A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go
From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: OTRS Recommendation-API SRE-tools

RobH added a subscriber: ayounsi.Jul 22 2019, 6:55 PM

Question for @ayounsi:

as long as we define librenms in the syslog settings (and set proper copy of snmp and the like) will these all report in correctly as we replace them?

RobH closed this task as Resolved.Jul 22 2019, 6:56 PM
RobH reopened this task as Open.

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:
A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go
From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: OTRS Recommendation-API SRE-tools

@RobH I have failed over dbproxy1001 to dbproxy1006 so A3 is also good to go.

RobH updated the task description. (Show Details)Jul 23 2019, 1:36 PM

The more "easy" racks for us in row B are B3 and B6. I propose we start with these.

rack B3 contains cloudvirt1027 and we would like to reallocate at least these VMs before doing the operations:

  • tools-puppetmaster-01
  • tools-docker-registry-04
  • proxy-01

rack B6 contains cloudvirt1029, and just noting here this important VM:

  • clouddb-wikilabels-02 (this VM is currently the secondary in the wikilables DB cluster)

I can have those servers ready to go tomorrow 2019-07-24 before your awake hours @RobH Also I need to send an email announcement to affected users (plenty of other VMs in those servers).

RobH updated the task description. (Show Details)

@RobH if you guys don't have any preference on which rack to start with...from the DB side, B3 can be a good option if it can be done before Tuesday 30th.
A month ago we scheduled a failover (T227062) for our s8 (wikidata) primary db master, and the new master (db1104) will be in B3, so if that rack can be done before Tuesday 30th, that's one less master we need to worry about :)

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

Not sure if known or expected already, but phase checks for new PDUs A3/A4/A5/A7 show up in icinga as UNKNOWN with External command error: Error in packet

Seems like we'll have to adapt the icinga checks / monitoring, filed as T229101: Phase monitoring for new PDUs

RobH removed Due Date.Jul 26 2019, 1:36 PM
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptJul 26 2019, 1:36 PM
RobH renamed this task from (July 22-26) install new PDUs in rows A/B (Top level tracking task) to Install new PDUs in rows A/B (Top level tracking task).Jul 26 2019, 1:36 PM
RobH moved this task from Backlog to Blocked on the ops-eqiad board.
RobH updated the task description. (Show Details)Jul 29 2019, 10:17 PM
RobH added a comment.EditedJul 31 2019, 5:00 PM

In reviewing the comments of T227138#5354060 and T226778#5358383, and in my IRC discussions with @wiki_willy, I propose the following schedule of rack swaps and cadence options.

Scheduling (Chris & James):

The current plan is 1 rack swap per day, allowing time for service migrations between racks and not requiring SRE sub-teams be online and attending for more than a few hours. Preference is for Tuesday and Thursday, but this won't work for every week due to the Chris/James overlap required.

What to do for your work?

If work is occurring in a rack that you have a server or service in, you will need to review the level of redundancy and crash recovery within your service. While we attempt to prevent power loss, accidents happen and we're working in live racks with many, many cables routed through them. Some services depool a server (cp, ms-fe) and leave them online, while others shut down power and services on a server (ms-be) or simply shift it from master usage (db).

2019-08-13 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
B6 - T227541

B6 is listed as not having any DB masters, and one of the two easiest racks for Cloud Services. Everyone will need to review T227541 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-14 - Wednesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A1 - T226782

A1 is one of our two primary network racks. DC-ops and Netops were ready to move this server, but it was postponed to ensure full review of all services within it before the window occurs. Please review T226782 and see if one of your servers/services runs on that rack, and if so, take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-20 - Tuesday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)

B3 - T227539

B3 is listed by Cloud Services as being their second easiest rack in row B via T226778#5358383. Cloud services will need to migrate some items, and the DBA team may need to migrate the wikitech master (up to Cloud Services per T227138#5354060?). Everyone will need to review T227539 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

2019-08-22 - Thursday - 14:00 GMT (10:00 Eastern) to 17:00 GMT (13:00 Eastern)
A8 - T227133

A8 is the last of our two primary networking racks. Everyone will need to review T227133 and see if one of your servers/services runs on that rack, and take pre-cautions for the migration based on the level of redundancy/crash recovery for your server/service.

The dates you mention the WMCS team will be barely available because travel/wikimania/offsites, etc. Since the racks are "easy" for us, this shouldn't be a blocker though. Our servers are mostly ready for the operations, and will re-review them a day before to ensure nothing new (important VM) were scheduled to run there.
So, ACK, good to go.

When the time comes to upgrade PDUs puppet should be updated too to reflect the new reality, specifically the facilities module to either add model => 'sentry4' to an existing pdu entry or add a brand new entry when we're adding new PDUs (e.g. ulsfo). I don't know what's the best way to include the above step when performing the work but noting it here, let me know if there's a better way!

Cmjohnson updated the task description. (Show Details)Aug 15 2019, 3:18 PM
wiki_willy updated the task description. (Show Details)Aug 15 2019, 5:28 PM
RobH added a comment.EditedTue, Sep 17, 10:26 PM

It seems that when the new PDU goes into place, it fails the icinga checks for:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ps1-a7-eqiad

ps1-a7-eqiad-infeed-load-tower-A-phase-X
ps1-a7-eqiad-infeed-load-tower-A-phase-Y
ps1-a7-eqiad-infeed-load-tower-A-phase-Z
ps1-a7-eqiad-infeed-load-tower-B-phase-X
ps1-a7-eqiad-infeed-load-tower-B-phase-Y
ps1-a7-eqiad-infeed-load-tower-B-phase-Z

this happens across all the new PDU towers as they come online in icinga and clear their ping check.

Seems we already have T229328.