Page MenuHomePhabricator

swap a2-eqiad PDU with on-site spare
Closed, ResolvedPublic

Description

This task will track the work required to prepare/stage, and then swap out the failed PDU tower in a2-eqiad. Details are as follows:

  • ps1-a2-eqiad is a dual input (tower A and B combined in a single PDU chassis) with 24 ports per tower.
  • ps1-a2-eqiad has had failures occur on its phases, either due to the PDU failing, or due to phase imbalance that cannot be corrected due to the limited number of power plugs per tower (only 24).
    • Chris will swap out the existing/failing ps1-a2-eqiad and put in a spare dual wide, 42 port per tower PDU. This isn't as ideal as a brand new PDU (via T210776), but the new PDU has a 30 day lead time.
  • All systems in a2-eqiad have to be reviewed, as downtime could result.
    • All precautions will be taken to try to migrate PDUs without downtime, but nothing is a certainty when dealing with the power feeds into our rack.
  • - list off all systems in a2-eqiad, check with service owners and schedule a downtime date before Chris leaves for all hands.

Maintenance Window Checklist

  • - @Cmjohnson stages new PDU adjacent or in rack, and unplugs the failed side of the existing PDU, plugging in one side of the replacement PDU
  • - @Cmjohnson migrates the now de-energized side of the old PDU plugs into the replacement PDU, returning redundant power to all devices
  • - @Cmjohnson de-energizes the remaining side of old PDU, energizing the replacement PDU fully, and migrates all remaining power to the new PDU

Maintenance Window Scheduling

Primary Date: Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT)
Backup Date: Tuesday, 2019-01-22 @ 07:00 EST (12:00 GMT)

Estimated Duration: Up to 2 hours

Servers & Devices in A2-eqiad

https://netbox.wikimedia.org/dcim/racks/2/

Network Devices:
The primary access switch for this row needs to be cross-cabled, just in case.
asw2-a2-eqiad
asw-a2-eqiad
msw-a2-eqiad

Servers:

analytics

conf1001 - not used anymore, in decom phase
kafka1012 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1013 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka1023 - please cross-cable one of the three kafka machines in this rack, doesnt matter which of the 3
kafka-jumbo1002 - need some heads up (like 10 mins) to gracefully stop kafka on it
an-worker1078 - can go down anytime, but a little heads up would be good to gracefully shutdown
an-worker1079 - can go down anytime, but a little heads up would be good to gracefully shutdown
db1107 - shared with Data Persistence - this needs time due to 1) stop eventlogging 2) stop replication from db1108 3) stop mysql gracefully

dba team systems

db1074 - replication slave, DBA team will stop mysql before work and restart after work ends
db1075 - This is not a master anymore (T213858) replication slave, DBA team will stop mysql before work and restart after work ends
db1079 - replication slave, DBA team will stop mysql before work and restart after work ends
db1080 - replication slave, DBA team will stop mysql before work and restart after work ends
db1081 - replication slave, DBA team will stop mysql before work and restart after work ends
db1082 - replication slave, DBA team will stop mysql before work and restart after work ends
es1011 - replication slave, DBA team will stop mysql before work and restart after work ends
es1012 - replication slave, DBA team will stop mysql before work and restart after work ends

other

cloudelastic1001 - not yet in use, can leave in place during pdu swap (no extra precautions needed)
ms-be1019 - can go down anytime, please issue a poweroff
ms-be1044 - can go down anytime, please issue a poweroff
ms-be1045 - can go down anytime, please issue a poweroff
tungsten - role(xhgui::app) - Performance-Team - @Gilles/@Krinkle confirm this can stay cabled normally, downtime wouldn't be problematic as long as its not for longer than the window.

Event Timeline

RobH triaged this task as High priority.Jan 14 2019, 6:43 PM
RobH created this task.
RobH updated the task description. (Show Details)Jan 14 2019, 6:51 PM
RobH updated the task description. (Show Details)Jan 14 2019, 6:54 PM
RobH updated the task description. (Show Details)Jan 14 2019, 6:57 PM
elukey updated the task description. (Show Details)Jan 14 2019, 7:00 PM
elukey updated the task description. (Show Details)Jan 14 2019, 7:03 PM
RobH updated the task description. (Show Details)Jan 14 2019, 7:09 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Jan 14 2019, 7:11 PM
RobH added a project: DBA.Jan 14 2019, 7:13 PM
RobH updated the task description. (Show Details)
RobH edited subscribers, added: Gilles, Krinkle; removed: Banyek.
RobH updated the task description. (Show Details)Jan 14 2019, 7:18 PM
RobH updated the task description. (Show Details)Jan 14 2019, 7:35 PM
RobH updated the task description. (Show Details)
RobH added a comment.Jan 14 2019, 7:51 PM

Please note the work has now been scheduled for Thursday, 2019-01-17 @ 07:00 EST (12:00 GMT). As both the DBA team and the Analytics team have expressed interest in stopping/restarting services on their servers, they should have those systems ready for work by the start of the maintenance window.

RobH updated the task description. (Show Details)Jan 14 2019, 7:57 PM

@RobH what's your plan with db1075 (the db master)?

RobH added a comment.EditedJan 14 2019, 7:59 PM

@RobH what's your plan with db1075 (the db master)?

@Cmjohnson will take 1 of the 2 power supplies and cross-cable it into the adjacent rack, so it won't be wholly dependent on the power within A2 to stay online. This way when he moves PDU feeds around, db1075 will always have one power supply plugged into a1 or a3 instead.

All cross-cabling will be done in this manner (including the primary asw and one of the kafka systems)

Awesome! Thanks for clarifying!

fgiunchedi updated the task description. (Show Details)Jan 15 2019, 9:21 AM
RobH added a comment.Jan 15 2019, 5:06 PM

@fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power them back up and ensure puppet runs post-maintenance? If not, should we simply leave powered off for you?

Please advise,

RobH added a comment.Jan 15 2019, 5:07 PM

@fgiunchedi: Thanks for updating about the ms-be systems! I see you added they can be gracefully powered down, can we just power them back up and ensure puppet runs post-maintenance? If not, should we simply leave powered off for you?
Please advise,

Chatted with him via irc:

09:02 < robh> : so power off before work and just power back up post work?
09:05 < godog> : hi! yeah exactly, out of caution really to avoid an unclean shutdown if we can possibly avoid it, in reality even power went out we could live with it
09:06 < godog> : I'll be around during the maint window too

Peachey88 updated the task description. (Show Details)Jan 17 2019, 7:35 AM

db1075, s3 primary master, was failed over to db1078 which is in row C.

@RobH is this happening today too along with a3 maintenance or is this finally moved to Tue 22nd?

Change 484987 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts for a2 rack maintenance

https://gerrit.wikimedia.org/r/484987

Change 484987 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool hosts for a2 rack maintenance

https://gerrit.wikimedia.org/r/484987

Mentioned in SAL (#wikimedia-operations) [2019-01-17T10:54:46Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool DBs on A2 rack T213748 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-01-17T10:54:51Z] <marostegui> Stop MySQL on db1082 db1081 db1080 db1079 db1075 db1074 es1012 es1011 - T213748

Mentioned in SAL (#wikimedia-operations) [2019-01-17T11:43:03Z] <marostegui> Poweroff db1082 db1081 db1080 db1079 db1075 db1074 es1012 es1011 - T213748

All the systems owned by the DBAs are now off.

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:11:38Z] <godog> poweroff ms-be1019 / ms-be1044 / ms-be1045 before A2 maint - T213748

mforns moved this task from Incoming to Radar on the Analytics board.Jan 17 2019, 6:03 PM
RobH closed this task as Resolved.Jan 17 2019, 8:55 PM
RobH claimed this task.

Synced up with Chris via IRC:

All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.

Mentioned in SAL (#wikimedia-operations) [2019-01-18T06:29:08Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool DBs on A2 rack T213748 (duration: 00m 47s)

faidon added a subscriber: faidon.Jan 18 2019, 4:02 PM

Synced up with Chris via IRC:
All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.

I don't think this is right, I think we didn't really replace A2's PDU with a spare but just replaced a fuse after all. @Cmjohnson can confirm.

RobH reopened this task as Open.Jan 18 2019, 4:05 PM
RobH reassigned this task from RobH to Cmjohnson.

Synced up with Chris via IRC:
All systems were able to come back up within a2 without incident. The spare PDU is in place, but it will also be replaced when rows A and B have PDU refresh this fiscal.

I don't think this is right, I think we didn't really replace A2's PDU with a spare but just replaced a fuse after all. @Cmjohnson can confirm.

Correct, it was only the fuse

Marostegui closed this task as Resolved.Jan 26 2019, 3:07 AM

I believe there is nothing else pending here, and this was re-opened just to get an answer from Chris, which was done.
Going to close this, if someone else feels it should remain open, feel free to do so!