Page MenuHomePhabricator

a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC)
Open, NormalPublic

Description

This task will track the migration of the ps1 and ps2 to be replaced with new PDUs in rack A2-eqiad.

Each server & switch will need to have potential downtime scheduled, since this will be a live power change of the PDU towers.

These racks have a single tower for the old PDU (with and A and B side), with the new PDUs having independent A and B towers.

  • - schedule downtime for the entire list of switches and servers.
  • - Wire up one of the two towers, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower
  • - Once new PDU tower is confirmed online, move on to next steps.
  • - Wire up remaining tower, energize, and relocate power to it from existing/old pdu tower (now de-energized).
  • - confirm entire list of switches, routers, and servers have had their power restored from the new pdu tower

List of routers, switches, and servers

deviceroleSRE team coordinationnotes
asw2-a2-eqiadasw@ayounsi
conf1001zookeeper/etcdserviceops
kafka1023kafkaAnalytics
kafka1013kafkaAnalytics
kafka1012kafkaAnalytics
db1107eventlogging dbAnalytics
tungsten
cloudelastic1001Discovery-Search@Gehel good to go
kafka-jumbo1002kafkaAnalytics
ms-be1045ms-be@fgiunchedipoweroff / poweron
ms-be1044ms-be@fgiunchedipoweroff / poweron
an-worker1079analyticsAnalytics
db1082dbDBA@Marostegui to depool this host
db1081db commons primary masterDBA
db1080dbDBA@Marostegui to depool this host
db1079dbDBA@Marostegui to depool this host
db1075db s3 primary masterDBA
db1074dbDBA@Marostegui to depool this host
ms-be1019ms-be@fgiunchedipoweroff / poweron
es1011external storeDBA@Marostegui to depool this host
an-worker1078analyticsAnalytics

Event Timeline

RobH created this task.Jul 2 2019, 7:58 PM
RobH updated the task description. (Show Details)
RobH triaged this task as Normal priority.Jul 2 2019, 8:06 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)
RobH added a subscriber: ayounsi.
RobH added a subscriber: fgiunchedi.
elukey added a subscriber: elukey.EditedJul 16 2019, 9:57 AM

The kafka10XX hosts are going to be decommed in T226517, so not a concern. The other hosts can go down without horrible consequences :)

I assume that you'll do one rack at the time, but asking anyway: in T226782 (a1) there is another kafka-jumbo host scheduled for maintenance, so it would be great if both of them wouldn't be at risk of loosing power at the same time.

Marostegui added a subscriber: Marostegui.EditedJul 22 2019, 2:53 PM

db1081 and db1075 are primary masters, so if we are not fully sure no power will be lost, I rather do other racks first
Racks on row A that are good to go:

A3: has one active dbproxy (dbproxy1001) I could failover tomorrow and then it should be good to go.
A4: good to go
A5: good to go if done before Thursday 30th as that day db1128 will become a master (T228243)
A7: good to go

From row B:
B1: good to go
B2: good to go after thursday 25th as we are failing over that host that day T228243
B3: It has m5 master which is mostly used by wikitech and cloud team, so you might want to ping them. From the DBAs side it is good to go.
B4: good to go
B6: good to go
B7: good to go
B8: it has m2 master which is mostly used by recommendationsapi, otrs, debmonitor, so if those stakeholders are ok, that is fine from a DBA point of view. Tags should be: OTRS Recommendation-API SRE-tools

Marostegui updated the task description. (Show Details)Jul 22 2019, 3:01 PM
Marostegui updated the task description. (Show Details)

Change 524805 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

Change 524805 merged by Marostegui:
[operations/dns@master] wmnet: Failover dbproxy1001 to dbproxy1006

https://gerrit.wikimedia.org/r/524805

akosiaris updated the task description. (Show Details)Jul 23 2019, 6:39 AM
akosiaris added a subscriber: akosiaris.

conf1001 is fine to powerdown (no depool necessary), perform all wanted actions and then poweron as it will repool itself automatically

fgiunchedi updated the task description. (Show Details)Jul 23 2019, 9:24 AM
RobH moved this task from Backlog to High Priority Task on the ops-eqiad board.Wed, Jul 24, 7:18 PM
RobH moved this task from High Priority Task to Blocked on the ops-eqiad board.Fri, Jul 26, 1:37 PM
RobH removed RobH as the assignee of this task.Wed, Aug 14, 4:52 PM
wiki_willy renamed this task from a2-eqiad pdu refresh to a2-eqiad pdu refresh (Tuesday 10/8 @11am UTC).Thu, Aug 15, 5:30 PM

We have to masters on this rack db1075 (s3) and db1104 (s4).
@wiki_willy how confident are you guys that this won't have an unexpected downtime? (cc @jcrespo)

Marostegui updated the task description. (Show Details)Mon, Aug 19, 10:32 AM
Gehel updated the task description. (Show Details)Mon, Aug 19, 4:15 PM
Gehel added a subscriber: Gehel.