Page MenuHomePhabricator

eqiad row C/D Observability host migrations
Closed, ResolvedPublic

Description

The network stacks in eqiad rows C and D are being upgraded to all 10G capable switches. Part of this migration will require all systems on the old switches to be moved to the new switch stack.

In previous migrations, we've stepped through the racks on by one, requiring each sub-team to be present for all affected hosts on the day of the migration. In an effort to better scale with the needs and schedules of multiple teams, we're planning to do this migration slightly different. Rather than a single date for each rack, we're providing a listing to each sub-team of all affected hosts, and that sub-team can then provide feedback with the priority and scheduling of the migration of hosts.

Scheduling Options and Considerations:

  • Provide priority groups for the hosts below, and we can move group 1, then 2, etc...
  • Provide specific dates and times for the migrations and we can coordinate the migration of the required host(s)
  • A mix of the above for easier hosts could be in groups where high priority or critical hosts could have specific date/times set.

The checklist for each hosts migration steps are being developed and won't be pasted in to each task for each host in advance of the move (since if there is an adjustment it is a lot of tasks to update.)

The host list is also available on the Google Sheet listing of all affected hosts.

Host(s) List:
alert1002 C7
kafka-logging1002 C2
kafka-logging1003 D4
logging-hd1003 D7
logstash1034 C2
logstash1035 D4
mwlog1002 C6
prometheus1007 D7
prometheus1008 C7
titan1002 D2

Details

Other Assignee
colewhite

Event Timeline

RobH assigned this task to herron.
RobH added subscribers: colewhite, herron.

@herron or @colewhite (not sure which of you is best to handle this, please reassign as needed!)

I'm looking to get some feedback for the scheduling of the above host list for migration from the old to new switches in eqiad c/d in the latter half of October. Please review the above details and provide feedback/questions.

Thanks in advance!

Sent a followup via email to Cole and Keith today:

Keith / Cole,

I assigned https://phabricator.wikimedia.org/T405946 over to you both for feedback but it might have been easily missed in the phab notifications and/or one of you may not be the correct person to handle this.

We need to move the network ports on 9 observability hosts, many of which are logging hosts. We're looking for input on how to best migrate the network port (expected to be a few minutes of connectivity loss) for each of these hosts. The planned start date for this work is November 1st, and we'd like to know how best to proceed before that date.

Would you be able to look at this task and provide feedback, or suggest someone in your team to work with me on this coordination?

Thanks in advance,

Added details to the spreadsheet thanks!

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

RobH triaged this task as High priority.Nov 19 2025, 10:57 PM

@herron,

We've migrated 9 of the 10 observability hosts. We're now only left with alert1002 which the notes detail will require scheduling. In addition to picking a scheduled date, can you also provide any steps and details I'll need to do the date of the migration?

Before each port migration I put the host into 10 min icinga maint mode, and then we perform a few steps with network and netbox scripts before either John or Valerie will move the switch port side of the network connection from the old to the new switch. I send pings to it just before, throughout, and for 30 seconds after the port migration to ensure the connection is up and stable.

Potential Migration Dates:

We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available and we can move any or all of your 12 hosts in a single day (or more) depending on your teams service needs. With a number of the hosts remaining being primary/secondary to one another, I am going to imagine it is best to move all redundant nodes on one day, and then move the primary nodes a day or two later.

That leaves us with: 2025-11-20, 2025-11-21, 2025-12-03, 2025-12-04. If possible, I'd like to move everything and get this done by the first week of December.

Thanks in advance!

We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available and we can move any or all of your 12 hosts in a single day (or more) depending on your teams service needs. With a number of the hosts remaining being primary/secondary to one another, I am going to imagine it is best to move all redundant nodes on one day, and then move the primary nodes a day or two later.

Yes to the latter. Following the staggered pattern you describe these can be migrated at dcops earliest convenience. Let's aim for about 24h between hosts in the same cluster (same hostname pattern) if possible

That leaves us with: 2025-11-20, 2025-11-21, 2025-12-03, 2025-12-04. If possible, I'd like to move everything and get this done by the first week of December.

2025-12-03 works for me for alert1002, could you put an even on the calendar for the time it will begin? There are a few pre/post steps I'll need to do in coordination with the move. Thanks!

I've set a gcal event for 2025-12003 @ 10AM EST / 15:00 GMT for the alert1002 migration.

RobH added a subscriber: Jclark-ctr.

This migration was completed just know with no issues. Thanks to both @Jclark-ctr and @herron for the on-site part and the icinga handling!