Page MenuHomePhabricator

Install new PDUs into b5-eqiad
Closed, ResolvedPublic

Description

We ordered a single set of PDUs to test before ordering the rest of a batch to update rows A and B.

This single set will need installation into b5-eqiad. For this, each host in b5 will need to be prepared for a maint window in which power may end up lost.

Proposed Window: Thursday, May 16th @ 0900 AM Eastern / 1300 GMT. Estimated window is 3 hours, but nothing is certain as this is the first PDU swap and will be used to judge the timeline for the full rows A and B swap.

Netbox listing for b5-eqiad

Hosts to plan for downtime during a window:

Active hosts:
cloudvirt1014 - good to go per T223148: Cloud Services: reallocate workload from rack B5-eqiad
cloudvirt1028 - good to go per T223148: Cloud Services: reallocate workload from rack B5-eqiad
db1098 - non master, can depool with a few hours heads up per T223126#5177373
db1131 - non master, can depool with a few hours heads up per T223126#5177373
db1139 - non master, can depool with a few hours heads up per T223126#5177373
dbproxy1004 - not in use at the moment per T223126#5177373
dbproxy1005 - not in use at the moment per T223126#5177373
dbproxy1006 - active m1 proxy can fail over a day in advance per T223126#5177373
labweb1001- good to go per T223148: Cloud Services: reallocate workload from rack B5-eqiad
ms-be1016 - will need to have swift + rsync stopped for good measure
ms-be1017 - will need to have swift + rsync stopped for good measure
ms-be1018 - will need to have swift + rsync stopped for good measure
ms-be1032 - will need to have swift + rsync stopped for good measure
ms-be1033 - will need to have swift + rsync stopped for good measure

Staged Host:
restbase1023 - staged per task T219404 but not yet in service (no data to lose, can just power off at start and power back on afterwards to make life easier.)

Event Timeline

RobH created this task.May 13 2019, 4:20 PM
RobH triaged this task as High priority.
RobH updated the task description. (Show Details)
Marostegui added subscribers: jcrespo, Marostegui.EditedMay 13 2019, 4:28 PM

Just checked the databases involved. They are easy to depool, we just need a couple of hours heads up.
dbproxy1006 is an active proxy for m1 but we can fail it over a day before with no issues.
dbproxy1004 and dbproxy1005 are not in use at the moment.

I am out Wednesday 15th, otherwise any other day works for me (cc @jcrespo)

RobH updated the task description. (Show Details)May 13 2019, 4:31 PM
RobH updated the task description. (Show Details)May 13 2019, 4:43 PM

CC @akosiaris @ayounsi @RobH for m1 proxy for potential even if unlikely impact on etherpad, bacula, puppet (the mysql database) & librenms, racktables & rt.

fgiunchedi updated the task description. (Show Details)May 13 2019, 4:45 PM
fgiunchedi added a subscriber: fgiunchedi.

Actions for ms-be hosts updated, to be on the safe side I'll stop swift + rsync in case power goes out. If it'll help I can poweroff hosts too. What time is this activity scheduled for ?

RobH added a comment.May 13 2019, 4:48 PM
This comment was removed by RobH.
RobH updated the task description. (Show Details)May 13 2019, 4:49 PM

Updated task description with maint window:

Proposed Window: Thursday, May 16th @ 0900 AM Eastern / 1300 GMT.

Bacula & puppet databases are not going to exhibit any problems anyway. Puppet is literally used only by servermon and this is to be uninstalled pretty soon and backups don't happen during that timewindow.
etherpad, given the software, is a best-effort service, so no guarantees there. it will probably crash anyway, be restarted by systemd (as it anyway does every couple of days), users will be reconnected.

TL;DR I 'll anyway be around but no actions need to be taken.

for m1 proxy for potential even if unlikely impact on etherpad, bacula, puppet (the mysql database) & librenms, racktables & rt.

It's fine for LibreNMS (can take a downtime), and racktables is not in used anymore.

Change 509894 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/dns@master] m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint

https://gerrit.wikimedia.org/r/509894

Andrew added a subscriber: Andrew.May 13 2019, 5:53 PM

Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours?

RobH added a comment.May 13 2019, 6:12 PM

Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours?

Any or all of those can happen, and we cannot really set any kind of probability on which.

The main issue is Chris will be taking a dual feed/dual wide PDU, and repacing it with 2 single feed, single wide PDUs. This increases our horizontal redundancy, since having the doublewide PDUs causes a single point of failure within that PDU chassis.

The issue here is this is our first test of these PDUs being installed into the racks. We don't know how easily they will fit, all while attempting to keep at least 1 of the two sides of the existing PDU online to prevent downtime.

Here is the kicker, we reduce each system to a single PSU during this process. If that remaining power cable is jostled (likely) it can become unplugged and then the server loses all power.

Best case: no interruption for anything with dual power supplies
Likely case: a small number of systems have their power cables jostled and reboot due to single feed during migration
Worst case: complete rack power loss for the duration of the PDU swap.

RobH updated the task description. (Show Details)May 13 2019, 6:12 PM
RobH updated the task description. (Show Details)May 13 2019, 7:39 PM

Change 509894 merged by Jcrespo:
[operations/dns@master] m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint

https://gerrit.wikimedia.org/r/509894

Change 510108 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1098 & db1131 for maintenance

https://gerrit.wikimedia.org/r/510108

dbproxy1006 switched over completely. The above patch (plus db1139 shutdown will be done hours before the maintenance).

aborrero updated the task description. (Show Details)May 16 2019, 10:58 AM

Change 510108 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1098 & db1131 for maintenance

https://gerrit.wikimedia.org/r/510108

Mentioned in SAL (#wikimedia-operations) [2019-05-16T12:02:58Z] <jynus> stop and shutdown db1098,db1131,db1139 T223126

Mentioned in SAL (#wikimedia-operations) [2019-05-16T12:21:41Z] <godog> stop swift and rsync on ms-be10[16,17,18,32,33] for eqiad B5 pdu replacement - T223126

Cmjohnson closed this task as Resolved.Mon, Jun 10, 6:59 PM
Cmjohnson claimed this task.

This has been completed