Page MenuHomePhabricator

eqiad: rack a3 pdu swap / failure / replacement
Closed, ResolvedPublic

Description

This task will track the swapping of the PDU tower(s) in rack A3-eqiad.

The current PDU tower is malfunctioning, with a short having caused issues on both side B (wholly offline) and parts of side a (one circuit group of outlets has defunct outlets.)

Chris has onsite spares (2 dual-wide PDU towers with 48 ports per side) to test out and use for replacement in this rack.

Maintenance Window Scheduling

Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT)
Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT)

Estimated Duration: Up to 2 hours

Maintenance Window Checklist

The following steps must be met for this swap:

  • - all servers will need to be taken offline and powered down for the duration of the migration
  • - old pdu must be removed from the rack, new pdu installed, all power migrated over to it

The side B of A3-eqiad may also have had the circuit breaker tripped during the failure, and may require Equinix technicians to flip the breaker in the EQ circuit breaker box.

Servers & Devices in A3-eqiad

The following items are in a3-eqiad: https://netbox.wikimedia.org/dcim/racks/3/

Servers (grouped by service owner when possible):

Analytics:

Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :)
The Thursday time window proposal is fine for me!

analytics1052
analytics1053
analytics1054
analytics1055
analytics1056
analytics1057
analytics1059
analytics1060

cloud:

cloudservices1004 is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our OpenStack deploy. It should be fine to perform a clean shutdown and restart. Pinging @aborrero and @GTirloni here as they are the folks from our team most likely to online to help at 12:00 GMT either day if anything strange happens as a result.

cloudservices1004

traffic:
cp1008 - canary host, has no production traffic, can be cleanly shutdown and powered back on after maint window.

dba:
db1103 - off
db1127 - server not even installed
dbproxy1001 - off
dbproxy1002 - off
dbproxy1003 - off
dbstore1003 - off
pc1004 - Not reachable via ssh, not in use, should be decommissioned (T210969) T213859#4883727.

discovery:

For elastic103[0-5], we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it.
That being said, 6 servers is a sizable portion of the cluster, I'd like to be around when that happens so that I can keep an eye on things.
Note: the Icinga "ElasticSearch health check for shards" is going to raise an alert if not silenced (not paging). I don't think any other alert should be raised, but we'll see.

elastic1030
elastic1031
elastic1032
elastic1033
elastic1034
elastic1035
relforge1001 - clean shutdown in advance of work and power back up afterwards

misc:
ganeti1007: The directions for https://wikitech.wikimedia.org/wiki/Ganeti#Reboot/Shutdown_for_maintenance_a_node can be used for this work.
graphite1003 (just a spare, powered down)
kubernetes1001 (worker can be drained/powered down prior to maintenance)
prometheus1003 (powered down)
radium - (in decom, powered down)
rdb1005 - @jijiki will be around during the maint window for this system

services:
@RobH synced with @Eevans about these. restbase 1016 is already offline. the other restbase systems can be logged into via SSH and cleanly shutdown just before the maintenance, and then powered back up normally post window.
restbase1010
restbase1011
restbase1016

Event Timeline

RobH triaged this task as High priority.Jan 15 2019, 8:17 PM
RobH created this task.
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Jan 15 2019, 8:20 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Jan 15 2019, 8:28 PM
RobH updated the task description. (Show Details)
RobH added a subscriber: Eevans.
Marostegui added a subscriber: jcrespo.
RobH updated the task description. (Show Details)Jan 15 2019, 8:36 PM
RobH updated the task description. (Show Details)Jan 15 2019, 8:38 PM
jijiki added a subscriber: jijiki.Jan 15 2019, 8:44 PM

For elastic103[0-5], we should be fine just shutting them down. The theory is that we should be able to loose a full row and not worry too much about it.

That being said, 6 servers is a sizable portion of the cluster, I'd like to be around when that happens so that I can keep an eye on things.

Note: the Icinga "ElasticSearch health check for shards" is going to raise an alert if not silenced (not paging). I don't think any other alert should be raised, but we'll see.

Unless @dcausse or @EBernhardson have any objection, I think we should just shutdown those servers and use that as a validation test that our assumptions are correct.

RobH renamed this task from eqiad: rack a2 pdu swap / failure / replacement to eqiad: rack a3 pdu swap / failure / replacement.Jan 15 2019, 8:55 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Jan 15 2019, 8:58 PM
RobH updated the task description. (Show Details)Jan 15 2019, 9:05 PM
Gehel added a comment.Jan 15 2019, 9:07 PM

relforge1001 can also be cleanly shutdown and restarted. It will crash the relforge cluster, but that cluster is not expected to be highly available. I'll warn the search platform team about it, they are the only users of that cluster.

Change 484572 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] role::eqiad::scb: Switch rdb1006 to redis::misc::master

https://gerrit.wikimedia.org/r/484572

RobH updated the task description. (Show Details)Jan 15 2019, 9:52 PM
RobH updated the task description. (Show Details)Jan 15 2019, 9:56 PM
RobH updated the task description. (Show Details)

cloudservices1004 is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our OpenStack deploy. It should be fine to perform a clean shutdown and restart. Pinging @aborrero and @GTirloni here as they are the folks from our team most likely to online to help at 12:00 GMT either day if anything strange happens as a result.

bd808 updated the task description. (Show Details)Jan 16 2019, 1:53 AM
CDanis added a subscriber: CDanis.Jan 16 2019, 3:21 AM

It's fine to simply shut down prometheus1003. We have a redundant machine prometheus1004 which will continue gathering metrics and answering queries. prometheus1003 will have a gap in its data afterwards but that can't be helped.

Marostegui added a subscriber: Marostegui.

pc1004 can be (and should be) powered off. That host is ready for decommissioning, I have not powered off myself as I am not sure if Chris is wiping disks or something at the moment T210969: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2019-01-16T11:02:57Z] <fsero> draining kubernetes1001 for maintenance T213859

elukey added a subscriber: elukey.Jan 16 2019, 11:22 AM

Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :)

The Thursday time window proposal is fine for me!

RobH added a comment.Jan 16 2019, 6:07 PM

Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :)
The Thursday time window proposal is fine for me!

@elukey: Ok, please stop them tomorrow in time for this window, thanks!

RobH updated the task description. (Show Details)Jan 16 2019, 6:07 PM

Change 484872 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/484872

Mentioned in SAL (#wikimedia-operations) [2019-01-17T08:24:55Z] <jijiki> Disabling puppet on rdb1005 and switch redis::misc::master to rdb1006 - T213859

Change 484572 merged by Effie Mouzeli:
[operations/puppet@production] role::eqiad::scb: Switch rdb1006 to redis::misc::master

https://gerrit.wikimedia.org/r/484572

Mentioned in SAL (#wikimedia-operations) [2019-01-17T08:32:47Z] <jijiki> Restarting nutcracker on scb100* for 484572 - T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T08:42:36Z] <jijiki> Enabling puppet on rdb1005 and switch redis::misc::master to rdb1006 - T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T09:24:35Z] <moritzm> power off graphite1003 for later hw maintenance (T213859)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T09:25:36Z] <marostegui> Poweroff dbstore1003 for hw maintenance T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T09:59:29Z] <marostegui> Poweroff dbproxy1001 dbproxy1002 dbproxy1003 for a3 maintenance - T213859

Change 484872 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103

https://gerrit.wikimedia.org/r/484872

Mentioned in SAL (#wikimedia-operations) [2019-01-17T10:04:14Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1103 - T213859 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T10:09:07Z] <marostegui> Stop MySQL on db1103:3312 and db1103:3314, also poweroff the server - T213859

Databases involved are fully ready for this maintenance. They are all off but pc1004 which is not reachable and not powered off per T213859#4883727 but it can go off anytime, no need to wait for DBAs.

Mentioned in SAL (#wikimedia-operations) [2019-01-17T10:30:10Z] <arturo> T213859 icinga downtime cloudservices1004 for 1 day

cloudservices1004 is the hot-spare in the cloudservices100[34] cluster supporting the eqiad1-r region of our OpenStack deploy. It should be fine to perform a clean shutdown and restart. Pinging @aborrero and @GTirloni here as they are the folks from our team most likely to online to help at 12:00 GMT either day if anything strange happens as a result.

I confirm I'm available. I just downtimed the server and will shutdown it now.

Mentioned in SAL (#wikimedia-operations) [2019-01-17T11:16:53Z] <onimisionipe> shutdown elastic103[0-5] to prepare for T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:17:36Z] <jijiki> poweroff rdb1005.eqiad.wmnet before A3 maint - T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:25:05Z] <godog> poweroff restbase1010 / restbase1011 before A3 maint - T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:34:05Z] <gehel> shutting down relforge1001 for PDU swap - T213859

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:41:13Z] <fsero> poweroff kubernetes1001 - T213859

RobH added a comment.EditedJan 17 2019, 8:49 PM

Update from IRC sync with @Cmjohnson, I = Chris not myself below:

  • Verified with each service owner that all servers were depooled and powered off
  • I ran power cables across to rack A4 for both asw and asw2-a3-eqiad to stay powered on. This was preferred over taking a chance of the network stack failing. Mark, verified the A4 pdu was able to handle the additional power need.
  • The mgmt switch is not redundant did go down.
  • Once all servers were off and network switches safely powered to A4, I removed the B side power first and attempted to pull the pdu away from the corner where it is located so that I could try and fit the new pdu into place. This turned out to be very problematic, the existing pdu did not come off it’s bracket and I had to remove the pdu and the bracket. Second, I had to manage to re-secure the mounting bracket for the new pdu. After several attempts to work in the confined space, I ended up complete removing the old pdu. Once that was completed I re-attached the brackets and mounted the new pdu. I then plugged the pdu in to the equinox power ports. I started with side A which powered on normally and then side B which did not power on. I then submitted a ticket with Equinix to verify the breaker had not tripped and to re-energize side B. While I waited on the Equinix technician, I plugged all the servers and switches into side A on the new PDU without an issue.
  • I updated the SCS with the mgmt ip and all relevant dns changes.
  • updated the PDU with the appropriate configuration (root, SNTP traps, Temp/humidity traps, GET, SET, etc.
  • Not long after, the Equinix tech came by and fixed the breaker and energized side B. I proceeded to plug in all redundant power.
  • I did not have any issues with the power, but 2 of the analytics boxes had bad disks which caused and issue during the boot process
  • restbase1016 was already for troubleshooting bad DIMM and remains down at this time.
RobH closed this task as Resolved.Jan 17 2019, 8:55 PM
RobH claimed this task.

So, the three failed hosts followups or existing tasks (all were existing):

analytics1054 T213038
analytics1056 T214057
restbase1016 T212418