Page MenuHomePhabricator

a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Closed, ResolvedPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A2-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a2-eqiadasw@ayounsi
conf1001zookeeper/etcdserviceopsto be decommed
kafka1023kafkaAnalyticsto be decommed
kafka1013kafkaAnalyticsto be decommed
kafka1012kafkaAnalyticsto be decommed
db1107eventlogging dbAnalyticsplease ping analytics to stop data flowing to the db temporarily
tungsten
cloudelastic1001Discovery-Search@Gehel good to go
kafka-jumbo1002kafkaAnalyticsok to proceed
ms-be1045ms-be@fgiunchedipoweroff / poweron
ms-be1044ms-be@fgiunchedipoweroff / poweron
an-worker1079analyticsAnalytics
db1082dbDBA@Marostegui to depool this host
db1081dbDBA@Marostegui to depool this host
db1080dbDBA@Marostegui to depool this host
db1079dbDBA@Marostegui to depool this host
db1075dbDBA@Marostegui to depool this host
db1074dbDBA@Marostegui to depool this host, needs to be powered off as it has a broken PSU
ms-be1019ms-be@fgiunchedipoweroff / poweron
es1011external storeDBA@Marostegui to depool this host
an-worker1078analyticsAnalyticsok to proceed

Event Timeline

RobH created this task.Jul 2 2019, 7:58 PM
RobH updated the task description. (Show Details)
RobH triaged this task as Normal priority.Jul 2 2019, 8:06 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a subscriber: ayounsi.
RobH added a subscriber: fgiunchedi.
elukey added a subscriber: elukey.EditedJul 16 2019, 9:57 AM

The kafka10XX hosts are going to be decommed in T226517, so not a concern. The other hosts can go down without horrible consequences :)

I assume that you'll do one rack at the time, but asking anyway: in T226782 (a1) there is another kafka-jumbo host scheduled for maintenance, so it would be great if both of them wouldn't be at risk of loosing power at the same time.

Marostegui added a subscriber: Marostegui.EditedJul 22 2019, 2:53 PM

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: OTRS Recommendation-API SRE-tools

Marostegui updated the task description. (Show Details)Jul 22 2019, 3:01 PM
Marostegui updated the task description. (Show Details)

Change 524805 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

Change 524805 merged by Marostegui:
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

akosiaris updated the task description. (Show Details)Jul 23 2019, 6:39 AM
akosiaris added a subscriber: akosiaris.

conf1001 is fine to powerdown (no depool necessary), perform all wanted actions and then poweron as it will repool itself automatically

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:24 AM
RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 24 2019, 7:18 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Jul 26 2019, 1:37 PM
RobH removed RobH as the assignee of this task.Aug 14 2019, 4:52 PM
wiki_willy renamed this task from a2-eqiad pdu refresh to a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).Aug 15 2019, 5:30 PM

We have to masters on this rack db1075 (s3) and db1104 (s4).
@wiki_willy how confident are you guys that this won't have an unexpected downtime? (cc @jcrespo)

Marostegui updated the task description. (Show Details)Aug 19 2019, 10:32 AM
Gehel updated the task description. (Show Details)Aug 19 2019, 4:15 PM
Gehel added a subscriber: Gehel.
Marostegui updated the task description. (Show Details)Tue, Sep 24, 6:29 AM
Marostegui updated the task description. (Show Details)Thu, Sep 26, 5:14 AM
elukey updated the task description. (Show Details)Wed, Oct 2, 6:26 AM

Change 541148 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool es1011

https://gerrit.wikimedia.org/r/541148

Change 541148 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool es1011

https://gerrit.wikimedia.org/r/541148

Mentioned in SAL (#wikimedia-operations) [2019-10-07T06:25:14Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1011 T227138 (duration: 01m 10s)

db1074 has a broken PSU and the new PSU is scheduled to arrive the 10th (T233567#5544445), so I will power off this host and will need to be powered on back @Cmjohnson or @Jclark-ctr

Marostegui updated the task description. (Show Details)Mon, Oct 7, 7:32 AM
Marostegui mentioned this in Unknown Object (Task).Mon, Oct 7, 7:54 AM

Mentioned in SAL (#wikimedia-operations) [2019-10-08T05:41:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1082 db1081 db1080 db1079 db1075 db1074 for PDU maintenance T227138', diff saved to https://phabricator.wikimedia.org/P9254 and previous config saved to /var/cache/conftool/dbconfig/20191008-054127-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-10-08T06:48:45Z] <marostegui> Stop MySQL on es011 db1082 db1081 db1080 db1079 db1075 db1074 (replication lag will appear on labs for s5) for on-site maintenance T227138

@Cmjohnson the following hosts are good to go: db1082 db1081 db1080 db1079 db1075 db1074 es1011
Please note:

  • db1074 has been powered off as it has a broken PSU, so please turn it back ON once the maintenance is done
  • db1107 is owned by Analytics, so please ping them before working with it unless they say otherwise.

Mentioned in SAL (#wikimedia-operations) [2019-10-08T12:27:11Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool es1012 T227138 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2019-10-08T12:38:40Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Repool es1012 T227138 (duration: 00m 51s)

Cmjohnson updated the task description. (Show Details)Tue, Oct 8, 5:45 PM

the pdu swap is over, we did lose an-worker1079 due to the PSUs not failing over. Everything is cabled and they're linked together. still needs updating.

wiki_willy reassigned this task from Cmjohnson to RobH.Tue, Oct 8, 6:16 PM

Re-assigning to @RobH to complete install/updating of new PDU. Thanks, Willy

RobH reassigned this task from RobH to Jclark-ctr.Wed, Oct 9, 4:42 PM

I've just attempted to connect to ps1-a2-eqiad via serial, and failed. To fix this, I'll outline the steps needed below and after coordination with @wiki_willy, determined best to assign this to @Jclark-ctr to fix (though @Cmjohnson is also able to do so, either can steal this task as needed.)

Please note these steps assume @Jclark-ctr has his shell access (in the dc ops group) working on his laptop (its active on the cluster.) If he doesn't have his config setup for this yet, please ping me in IRC and I'll assist you in your ssh config/setup.

I'll assume John is doing this, so I'll outline the full steps needed to fully fix and test the fix before handing this back to me.

  • ps1-a2-eqiad's serial console port (orange cable) should be connected to both the PDU tower, and then back to port 2 on scs-a8-eqiad (the opengear console in rack A8).
  • Once it is connected, you can test the serial connection as follows:
    • ssh root@scs-a8-eqiad.mgmt.eqiad.wmnet and use the management scs password.
    • once connected, run pmshell and hit enter. It will list all ports, pick port 2 and hit enter.
    • it should prompt with a login screen, if it doesn't, the serial connection is failing.

If the serial connection is failing, then the orange patch cable may need to have the ends re-crimped or the cable replaced. If this patch uses the black in-line adapter (on the PDU side of the orange cable), then you can use a standard orange patch cable. If it doesn't have an in-line adapter, you'll have to make a special cable. Please coordinate with @RobH before you do so, as we may just temp use the adjacent rack serial to get this setup quickly.

RobH added a comment.Thu, Oct 10, 3:30 PM

@Jclark-ctr and I went through the following to fix this issue:

  • tested (failed) scs-a8-eqiad port 2 to ps1-a2-eqiad connection
  • tested (works) scs-a8-eqiad:3 to ps1-a3-eqiad
  • moved ps1-a3-eqiad connection to ps1-a2-eqiad and it worked (so the PDU serial is functional)
  • moved the working connection from port 3 to port 2 on the scs, still worked (scs is functional)
  • determined it was a bad cable between scs-a8-eqiad:port2 and ps1-a2-eqiad.

End result is I'll make a sub-task for that repair to take place. While we had working serial during the testing, we went ahead and setup the network and unblocked this deployment.

RobH closed this task as Resolved.Thu, Oct 10, 3:40 PM

Please note that with the temp serial run, we went ahead and setup ps1-a2-eqiad.

The existing serial needs to be fixed though.

RobH reopened this task as Open.EditedThu, Oct 10, 8:04 PM

I should not have resolved this so quickly, as it needs a few other things handled.

I just went ahead and put the old pdu to its asset tag name and updated the hostname for the new pdu for ps1-a2-eqiad in netbox.

However, I did not add in ps2-a2-eqiad, as I cannot tell what its asset tag or serial number is from polling the device. (I was able to get the serial for ps1, and update netbox.)

Please update ps2-a2-eqiad with whatever PDU link tower/serial number/asset tag is there. IIRC you already put all of the PDU towers into netbox with serial number + asset tag, so just need to update the one installed with the hostname and location.

Jclark-ctr closed this task as Resolved.Fri, Oct 11, 11:49 PM

updated ps2-a2-eqiad and location set to active.