Page MenuHomePhabricator

eqiad row C/D Data Platform host migrations
Closed, ResolvedPublic

Description

The network stacks in eqiad rows C and D are being upgraded to all 10G capable switches. Part of this migration will require all systems on the old switches to be moved to the new switch stack.

In previous migrations, we've stepped through the racks on by one, requiring each sub-team to be present for all affected hosts on the day of the migration. In an effort to better scale with the needs and schedules of multiple teams, we're planning to do this migration slightly different. Rather than a single date for each rack, we're providing a listing to each sub-team of all affected hosts, and that sub-team can then provide feedback with the priority and scheduling of the migration of hosts.

Scheduling Options and Considerations:

  • Provide priority groups for the hosts below, and we can move group 1, then 2, etc...
  • Provide specific dates and times for the migrations and we can coordinate the migration of the required host(s)
  • A mix of the above for easier hosts could be in groups where high priority or critical hosts could have specific date/times set.

The checklist for each hosts migration steps are being developed and won't be pasted in to each task for each host in advance of the move (since if there is an adjustment it is a lot of tasks to update.)

The host list is also available on the Google Sheet listing of all affected hosts.

Host(s) List:

HostRackNotes regarding role or hostAction required
an-backup-datanode1001D2decommissionedNone
an-backup-datanode1003D7decommissionedNone
an-backup-datanode1034C2decommissionedNone
an-backup-datanode1035C4decommissionedNone
an-backup-datanode1036C4decommissionedNone
an-backup-datanode1037C4decommissionedNone
an-backup-datanode1038C4decommissionedNone
an-backup-datanode1039C7decommissionedNone
an-backup-datanode1040C7decommissionedNone
an-backup-datanode1041C2decommissionedNone
an-backup-datanode1042D2decommissionedNone
an-backup-datanode1043D4decommissionedNone
an-backup-datanode1044D4decommissionedNone
an-backup-datanode1045D7decommissionedNone
an-backup-datanode1046D7decommissionedNone
an-backup-namenode1001C4decommissionedNone
an-backup-namenode1002D7decommissionedNone
an-conf1006C31/3 nodes - should be fine to take down a single node at a timeNone
an-druid1005D31/5 nodes - should be fine to take down a single node at a timeNone
an-master1003C2Primary HDFS namenode of the production Hadoop clusterWill probably need a manual failover of the HDFS namenode service to an-master1004
an-master1004D7Standby HDFS namenode of the production Hadoop clusterWill need to ensure that an-master1003 is the active namenode, or carry out a manual failover if not
an-presto1019C41/15 nodes - should be fine to take down a single node at a timeNone
an-presto1020D21/15 nodes - should be fine to take down a single node at a timeNone
an-redacteddb1001D2SPOF. Analytics dedicated copy of the wikireplicasAvoid first few days of the month, if possible
an-test-coord1001D3SPOF, but only on the test Hadoop cluster - not criticalNone
an-test-master1002C3Standby HDFS namenode of the production Hadoop clusterWill need to ensure that an-test-master1001 is the active namenode, or carry out a failover if not
an-test-worker1002C51/3 nodes - should be fine to take down a single node at a timeNone
an-test-worker1003D61/3 nodes - should be fine to take down a single node at a timeNone
an-worker1088C2decommissionedNone
an-worker1089C4decommissionedNone
an-worker1090C4decommissionedNone
an-worker1091C7decommissionedNone
an-worker1092D2decommissionedNone
an-worker1093D2decommissionedNone
an-worker1094D7decommissionedNone
an-worker1095D7decommissionedNone
an-worker1131C21/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1132C21/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1133C71/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1134D41/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1135D41/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1136D41/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1137D41/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1138D41/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1151C71/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1152D71/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1175D21/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1180C71/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1209D21/~120 nodes - should be fine to take down a single node at a timeNone
an-worker1234C21/~120 nodes - should be fine to take down a single node at a timeNone
druid1008D6decommissionedNone
druid1012C21/5 nodes - should be fine to take down a single node at a timeIdeally, depool before and pool afterwards
druid1013D21/5 nodes - should be fine to take down a single node at a timeIdeally, depool before and pool afterwards
stat1011C7SPOF - One of the analytics clientsWill require notifying users of downtime ahead of time
dse-k8s-worker1003C41/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1004D41/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1010D61/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1011C51/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1013C51/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1018D31/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook
dse-k8s-worker1019C61/19 k8s workersWill require draining the host with sre.k8s.pool-depool-node cookbook

Event Timeline

RobH added a subscriber: BTullis.

@BTullis,

After asking Guillaume he said I should work with you as point of contact for these migrations (though that you would still be discussing within your team). Assigning to you for coordination with the above host list on how we should migrate them over. Please review and provide feedback/questions.

Thanks!

@BTullis,

This is by far the most detailed overview of the host lists provided so far, thank you! I'll review the above list with both John and Valerie to determine what order they'll tackle them.

For all the items with 'should be fine to take down a single node at a time' do we need to run any kind of depool on the host before we move the network port or can they just experience a moment or two of network connectivity loss during uptime? Ideally we'll leave them alone and just move the network cable while running the update script, so just a minute or two of network connectivity loss.

@BTullis For an-test-master1002 we need to failover to it self when we move it or is that a typo?

@BTullis,
For all the items with 'should be fine to take down a single node at a time' do we need to run any kind of depool on the host before we move the network port or can they just experience a moment or two of network connectivity loss during uptime? > Ideally we'll leave them alone and just move the network cable while running the update script, so just a minute or two of network connectivity loss.

Thanks @RobH - Leaving them alone and just moving the cable will be fine, for the an-worker hosts. There is a procedure for temporarily excluding these hosts from the Hadoop cluster (HDFS and YARN), but for a minute or two of connectivity loss it's not really worth it. We could add a short downtime to all of the affected hosts ahead of time to minimise alerts, but even that's not crucial. I can leave that decision up to you.

For all the items with 'should be fine to take down a single node at a time' do we need to run any kind of depool on the host before we move the network port or can they just experience a moment or two of network connectivity loss during uptime?

It's also fine to carry out the work on the following hosts without any pre/post actions:

  • an-presto10[19-20]
  • an-druid1005
  • an-conf1006
  • an-test-worker100[2-3]

For the hosts: druid101[2-3] these are LVS realservers, so the depool and pool commands are available. It might be a good idea to use them for these two hosts, so that we don't get any alerts from pybal checks etc.

@BTullis For an-test-master1002 we need to failover to it self when we move it or is that a typo?

Good catch, thanks. It was a typo, but I have also clarified this in the table now.
These an-master and an-test-master hosts are always in a hot-standby state, where one of each pair is designated as the primary (an-master1003 and an-test-master1001) and the others in the pair are the standby servers.

It should always be safe to restart the standby server, but it's always worth just checking that we're not in a failed-over state before doing so.
Sometimes the failover/failback of the production Hadoop namenodes (an-master100[3-4]) can be a bit fraught as per T310293: HDFS Namenode fail-back failure, so we might benefit from a run-up at it.

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

Regarding the an-presto move issue the link came up both sides I can see in the logs. I do notice this on the switch though:

2025-11-06T17:02:22.932 sr_l2_mac_mgr: bridgetable|6311|N: A duplicate MAC address 04:32:01:DC:14:00 was detected on sub-interface ethernet-1/18.1022.

It may just be a short-term thing before the old MAC learnt from the asw dies out. We can re-try the move and investigate more next week.

@BTullis,

I wanted to move dse-k8s-worker1010.eqiad.wmnet today as part of the migration, but the details in this task simply read: Will require draining the host with sre.k8s.pool-depool-node cookbook.

However, I'm not 100% sure on running this, as I'm unable to determine what cluster a given node belongs to:

robh@cumin2002:~$ sudo cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-eqiad depool dse-k8s-worker1010.eqiad.wmnet
....
RuntimeError: Cannot find the hosts dse-k8s-worker1010.eqiad.wmnet among any k8s nodes in cluster wikikube-eqiad

So questions:

  • How can I determine what cluster a given host belongs to so I can include it in my depool command?
  • Should I repool when it is moved with the opposing pool command? (just add back to the cluster it was depooled from)
  • Will the cookbook run until depooled, or do I need to check pool status before we move the host outside of the cookbook run?

@BTullis,

Please note we now only have 12 data platform hosts remaining for migration. I still need clarification for

@BTullis,

I wanted to move dse-k8s-worker1010.eqiad.wmnet today as part of the migration, but the details in this task simply read: Will require draining the host with sre.k8s.pool-depool-node cookbook.

However, I'm not 100% sure on running this, as I'm unable to determine what cluster a given node belongs to:

robh@cumin2002:~$ sudo cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-eqiad depool dse-k8s-worker1010.eqiad.wmnet
....
RuntimeError: Cannot find the hosts dse-k8s-worker1010.eqiad.wmnet among any k8s nodes in cluster wikikube-eqiad

So questions:

  • How can I determine what cluster a given host belongs to so I can include it in my depool command?
  • Should I repool when it is moved with the opposing pool command? (just add back to the cluster it was depooled from)
  • Will the cookbook run until depooled, or do I need to check pool status before we move the host outside of the cookbook run?

In addition to the above clarification for depooling/repooling dse nodes, we have the following nodes:

an-master1003
an-master1004
an-redacteddb1001
an-test-coord1001
an-test-master1002
stat1011

All of these details that they are SPOF or critical, so I wanted to check with you on how to proceed for these. Can you detail out which ones we can move this week and in what order/cadence? We're happy to move these on our own timeline (during our workdays) or we can set a specific date/time for each host.

Please advise on both the above SPOF host scheduling and dse depooling commands, thank you!

@BTullis,

We're now down to 44 hosts overall to migrate, and 12 of those belong to your team.

Please note for all migrations I run the icinga downtime cookbook for 10 minutes, even though the actual move takes only seconds. Each host tends to lose connectivity for less than 10 seconds, and open ssh sessions to it resume uninterrupted as long as you aren't sending a command those 10 seconds. We ping each host during the migration to ensure it comes back online on its new network port (its IP, speed and other details remain unchanged).

Potential Migration Dates:

We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available and we can move any or all of your 12 hosts in a single day (or more) depending on your teams service needs. With a number of the hosts remaining being primary/secondary to one another, I am going to imagine it is best to move all redundant nodes on one day, and then move the primary nodes a day or two later.

That leaves us with: 2025-11-20, 2025-11-21, 2025-12-03, 2025-12-04. If possible, I'd like to move everything and get this done by the first week of December.

an-master1003
Provided: Primary HDFS namenode of the production Hadoop cluster, will probably need a manual failover of the HDFS namenode service to an-master1004
Please see below details for an-master1004, as once you detail how we move it and fail over to it, this will follow the same procedure.

an-master1004
Provided: Standby HDFS namenode of the production Hadoop cluster Will need to ensure that an-test-master1001 is the active namenode, or carry out a failover if not
Question: Does this mean we can just migrate an-master1004 at any time? If so I'd like to migrate it on either 2025-11-20 or 2025-11-24. Once it is moved I'd like your team to fail HDFS primary from an-master1003 to an-master1004, and then we can migrate an-master1003 the day after this is moved. Would this work and if so please select a date.

an-redacteddb1001
Provided: SPOF. Analytics dedicated copy of the wikireplicas, Avoid first few days of the month, if possible
Question: Do we need to do anything other than icinga maint mode cookbook for this? If this is all that is needed, lets plan to move this on Wednesday, 2025-12-03? Please detail anything I need to do prior to its migration on the proposed date.

an-test-coord1001
Provided: SPOF, but only on the test Hadoop cluster - not critical
Question: What do I need to do before we migrate this? Do you want us to migrate an-test-master1002 first on a set date and then your team can failover this host services to it for us to migrate this host the day following an-test-master1002?

an-test-master1002
Provided: Standby HDFS namenode of the production Hadoop cluster, Will need to ensure that an-test-master1001 is the active namenode, or carry out a failover if not
Question: What do I need to do before we migrate this?

stat1011
Provided: SPOF - One of the analytics clients, requires scheduling
Question: What exactly do I need to do for this migration (in addition to icinga downtime cookbook) and can we schedule it on one of the dates listed above?

Sent an email to @BTullis to ensure he is aware of these 12 hosts pending his feedback, subject line: Need IF feedback for 12 remaining hosts since November 12th

IRC Update:

@BTullis and I have chatted via IRC and worked out scheduling for the 12 remaining hosts:

  • 8 will be migrated on Monday 2025-11-24 @ 18:45 GMT
    • Ben will take care of depooling these from service in advance of the work but leave them powered up with OS responsive to pings.
  • Rob checks in with Ben via this task or IRC to ensure all failovers are complete before the second window start
  • 4 will be migrated on Tuesday 2025-11-25 @ 18:00 GMT

This will complete the Data Platform host migrations for this project.

Apologies for the delays in responding.
After discussion with @RobH we have decided on the following:

At 2025-11-24 @ 18:45 GMT we will migrate the following hosts:

  • an-test-master1002
  • dse-k8s-worker10[11,13,19]
  • stat1011
  • an-redacteddb1001
  • an-test-cood1001
  • an-master1004

The on the following day, I will perform a manual failover the Hadoop namenode service, from an-master1003 to an-master1004
Assuming this failover is fine, I will be able to give you the go-ahead to to the remaining hosts.

This is planned for: 2025-11-25 @ 18:00 GMT and we will migrate the remaining hosts:

  • dse-k8s-worker10[04,10,18]
  • an-master1003

I have drained dse-k8s-worker10[11,13,19] prior to this afternoon's maintenance.

root@deploy2002:~# kubectl get nodes
NAME                             STATUS                     ROLES           AGE      VERSION
dse-k8s-ctrl1001.eqiad.wmnet     Ready                      control-plane   2y274d   v1.23.14
dse-k8s-ctrl1002.eqiad.wmnet     Ready                      control-plane   2y274d   v1.23.14
dse-k8s-worker1001.eqiad.wmnet   Ready                      <none>          2y274d   v1.23.14
dse-k8s-worker1002.eqiad.wmnet   Ready                      <none>          2y274d   v1.23.14
dse-k8s-worker1003.eqiad.wmnet   Ready                      <none>          2y274d   v1.23.14
dse-k8s-worker1004.eqiad.wmnet   Ready                      <none>          2y274d   v1.23.14
dse-k8s-worker1005.eqiad.wmnet   Ready                      <none>          2y270d   v1.23.14
dse-k8s-worker1006.eqiad.wmnet   Ready                      <none>          2y270d   v1.23.14
dse-k8s-worker1007.eqiad.wmnet   Ready                      <none>          2y270d   v1.23.14
dse-k8s-worker1008.eqiad.wmnet   Ready                      <none>          2y270d   v1.23.14
dse-k8s-worker1009.eqiad.wmnet   Ready                      <none>          450d     v1.23.14
dse-k8s-worker1010.eqiad.wmnet   Ready                      <none>          189d     v1.23.14
dse-k8s-worker1011.eqiad.wmnet   Ready,SchedulingDisabled   <none>          188d     v1.23.14
dse-k8s-worker1012.eqiad.wmnet   Ready                      <none>          165d     v1.23.14
dse-k8s-worker1013.eqiad.wmnet   Ready,SchedulingDisabled   <none>          165d     v1.23.14
dse-k8s-worker1014.eqiad.wmnet   Ready                      <none>          63d      v1.23.14
dse-k8s-worker1015.eqiad.wmnet   Ready                      <none>          110d     v1.23.14
dse-k8s-worker1016.eqiad.wmnet   Ready                      <none>          110d     v1.23.14
dse-k8s-worker1017.eqiad.wmnet   Ready                      <none>          110d     v1.23.14
dse-k8s-worker1018.eqiad.wmnet   Ready                      <none>          110d     v1.23.14
dse-k8s-worker1019.eqiad.wmnet   Ready,SchedulingDisabled   <none>          110d     v1.23.14

I have also verified that an-master1003 is currently the active HDFS namenode.

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     active    
an-master1004.eqiad.wmnet:8040                     standby

I notified users of possible disturbance for stat1011 this afternoon, so I think that we're good to go.

Icinga downtime and Alertmanager silence (ID=77fc5d5e-4014-4521-90fb-3e67d8114900) set by btullis@cumin1003 for 4:00:00 on 3 host(s) and their services with reason: Prepping for switch swap

dse-k8s-worker[1011,1013,1019].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=7d21afc7-5634-452f-ae59-c9787b2c0108) set by btullis@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Prepping for switch swap

an-test-master1002.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=2ceb8409-0adc-48e2-b350-9299f0cfd430) set by btullis@cumin1003 for 4:00:00 on 1 host(s) and their services with reason: Prepping for switch swap

stat1011.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=a41ee425-7380-4cb9-8254-04c2c38218ab) set by btullis@cumin1003 for 4:00:00 on 3 host(s) and their services with reason: Prepping for switch swap

an-master1004.eqiad.wmnet,an-redacteddb1001.eqiad.wmnet,an-test-coord1001.eqiad.wmnet

an-test-master1002
dse-k8s-worker1011
dse-k8s-worker1013
dse-k8s-worker1019
stat1011
an-redacteddb1001
an-test-coord1001
an-master1004

These servers have been migrated with assistance of Ben

Mentioned in SAL (#wikimedia-analytics) [2025-11-24T22:09:08Z] <btullis> failing over the Hadoop nameserver from an-master1003 to an-master1004 for T405943

I have failed over the active namenode, so an-master1003 is now ready for the network cable move.

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     active    
an-master1004.eqiad.wmnet:8040                     standby   

btullis@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1003-eqiad-wmnet an-master1004-eqiad-wmnet
Failover to NameNode at an-master1004.eqiad.wmnet/10.64.53.14:8040 successful

btullis@an-master1003:~$ sudo kerberos-run-command hdfs /usr/bin/hdfs haadmin -getAllServiceState
an-master1003.eqiad.wmnet:8040                     standby   
an-master1004.eqiad.wmnet:8040                     active    
btullis@an-master1003:~$

Finished moving Last servers for this ticket

an-master1003
dse-k8s-worker1018
dse-k8s-worker1004
dse-k8s-worker1010

Jclark-ctr claimed this task.