Page MenuHomePhabricator

eqiad row C/D Service Ops host migrations
Closed, ResolvedPublic

Description

The network stacks in eqiad rows C and D are being upgraded to all 10G capable switches. Part of this migration will require all systems on the old switches to be moved to the new switch stack.

In previous migrations, we've stepped through the racks on by one, requiring each sub-team to be present for all affected hosts on the day of the migration. In an effort to better scale with the needs and schedules of multiple teams, we're planning to do this migration slightly different. Rather than a single date for each rack, we're providing a listing to each sub-team of all affected hosts, and that sub-team can then provide feedback with the priority and scheduling of the migration of hosts.

Scheduling Options and Considerations:

  • Provide priority groups for the hosts below, and we can move group 1, then 2, etc...
  • Provide specific dates and times for the migrations and we can coordinate the migration of the required host(s)
  • A mix of the above for easier hosts could be in groups where high priority or critical hosts could have specific date/times set.

The checklist for each hosts migration steps are being developed and won't be pasted in to each task for each host in advance of the move (since if there is an adjustment it is a lot of tasks to update.)

The host list is also available on the Google Sheet listing of all affected hosts.

Host(s) List:
conf1009 D3
kafka-main1008 C3
kafka-main1009 D3
kubestage1004 D3
mc-gp1005 C2
mc-gp1006 D7
mc-wf1002 D8
mc1045 C2
mc1046 C2
mc1047 C2
mc1048 C2
mc1049 C4
mc1050 C4
mc1051 D4
mc1052 D4
mc1053 D4
mc1054 D4
rdb1012 D1
rdb1014 D3
wikikube-ctrl1001 D7
wikikube-ctrl1003 C2
wikikube-worker1004 D8
wikikube-worker1016 C3
wikikube-worker1019 D8
wikikube-worker1020 D8
wikikube-worker1034 D3
wikikube-worker1036 C6
wikikube-worker1037 D8
wikikube-worker1051 C6
wikikube-worker1052 C6
wikikube-worker1053 C6
wikikube-worker1054 C6
wikikube-worker1055 C6
wikikube-worker1062 C3
wikikube-worker1063 C3
wikikube-worker1067 D8
wikikube-worker1068 D8
wikikube-worker1069 D8
wikikube-worker1070 D8
wikikube-worker1071 D8
wikikube-worker1083 C6
wikikube-worker1096 D8
wikikube-worker1097 D8
wikikube-worker1107 D8
wikikube-worker1108 D8
wikikube-worker1109 D8
wikikube-worker1110 D8
wikikube-worker1135 C5
wikikube-worker1136 C5
wikikube-worker1137 C5
wikikube-worker1138 C5
wikikube-worker1139 C5
wikikube-worker1140 D1
wikikube-worker1141 D1
wikikube-worker1154 C5
wikikube-worker1155 C5
wikikube-worker1156 C5
wikikube-worker1157 C3
wikikube-worker1159 D3
wikikube-worker1160 D1
wikikube-worker1161 D1
wikikube-worker1162 D3
wikikube-worker1163 D3
wikikube-worker1164 D8
wikikube-worker1165 D8
wikikube-worker1167 D8
wikikube-worker1168 D8
wikikube-worker1260 C6
wikikube-worker1261 C6
wikikube-worker1262 C6
wikikube-worker1263 C6
wikikube-worker1264 C6
wikikube-worker1265 C6
wikikube-worker1266 C6
wikikube-worker1267 C6
wikikube-worker1268 C6
wikikube-worker1269 C6
wikikube-worker1270 D1
wikikube-worker1271 D1
wikikube-worker1272 D1
wikikube-worker1273 D1
wikikube-worker1274 D1
wikikube-worker1275 D1
wikikube-worker1305 C3
wikikube-worker1306 C5
wikikube-worker1313 C3

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Clement_Goubert changed the task status from Open to In Progress.Oct 2 2025, 10:55 AM
Clement_Goubert claimed this task.
Clement_Goubert triaged this task as Medium priority.

wikikube-ctrl1001 is waiting for decom/derack and can probably be removed from this list (cf T383227)
We'll try to get this done ASAP.

Tagging @jijiki for mc hosts, @akosiaris for rdb hosts, @brouberol for kafka hosts, @Scott_French for confd hosts

conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the upstream source for etcd-mirror replication to codfw, and (3) not a PyBal config host. Also, it sounds like the connectivity disruption should be very brief, similar to previous migrations (i.e., switch-side of the cable just moves from adjacent old device to new).

Given that, the simplest reasonable sequence would be:

  • Silence EtcdReplicationDown
  • Migrate to the new switch
  • On "done" from DC-Ops, in parallel:
    • Test conf1009 (e.g., check cluster membership on peers, probe conf1009 with a quorum-read)
    • Check the health of etcd-mirror on conf2005 and restart if needed
  • Delete silence
  • Restart eqiad-associated confds and navtiming, verify Liberica control-plane daemons are healthy

Rationale:

  • This is only a single cluster member (i.e., the cluster will maintain quorum, though there may be an election), so there's no strong justification to temporarily point eqiad-associated clients at codfw, particularly given the amount of effort and disruption (e.g., read-only periods) involved.
  • We could temporarily point etcd-mirror at a different member, but I don't think that's worth the effort either. The timed-watch strategy it uses tends to be remarkably robust to "temporarily pull the network cable" (more so than to a proper upstream connection close). Also, the "check and restart if needed" approach worked well when the recent round of nginx updates reached conf1009.
  • There is some risk that if replication is disrupted and a large influx of writes pushes us out of the 1000 event log window, then recovery is a bit involved. We could rule this out by making etcd read-only during the migration. However, that turns the "oops, your client hit a transient error attempting to communicate with conf1009" into "full write unavailability," which again seems rather disruptive for a low-likelihood issue.

In any case, I'll give some thought to timing and follow up in the sheet. It's only the last point that I'm still on the fence about (i.e., whether to bracket the migration with read-only), but (1) that's technically easy to do if we choose to do it and (2) should not have much of an impact on scheduling (it mostly just requires additional upfront comms).

@Clement_Goubert,

Just checking in as there hasn't been any update to the google sheet for the ServiceOps new hosts yet.

I've added the notes for wikikube-ctrl1001 being a decom and to coordinate directly with @Scott_French for the migration of conf1009. Still need details for the other ~80 hosts though.

This work was originally slated to start this week, but we've pushed it back to the first of November. We would still like to get all the migration details worked out this week in advance of the actual migration start.

Please let me know if there is anything I can do to assist with getting this figured out!

I'm so sorry I haven't got around to it. Doing it now.

Done for all wikikube-worker and wikikube-ctrl. I can make myself available when you do it, or you can ping anyone from the team, I'll brief them on what needs to be done.

Sorry for the wait!

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

@Clement_Goubert,

Is it possible that I could send the commands for this or do we need someone in your team? If we need someone in your team, could we schedule an hour or so for this tomorrow (Tuesday, Nov 18thy) at 17:00GMT start time? (So 9:00 AM Pacific?)

Apart from the drain (which I've not done and you've detailed someone in your team should do) the other steps to move a host is about 5 minutes total per host, with network connectivity loss of approximately 1 minute or less. (We ping the host during the move and it misses less than 12 ping sequence numbers.)

@Clement_Goubert,

Is it possible that I could send the commands for this or do we need someone in your team? If we need someone in your team, could we schedule an hour or so for this tomorrow (Tuesday, Nov 18thy) at 17:00GMT start time? (So 9:00 AM Pacific?)

Apart from the drain (which I've not done and you've detailed someone in your team should do) the other steps to move a host is about 5 minutes total per host, with network connectivity loss of approximately 1 minute or less. (We ping the host during the move and it misses less than 12 ping sequence numbers.)

Yeah, I'll be available to help. Time sounds good to me, I can also show you how to do the drain so you can be autonomous for this going forwards, your call.

Awesome! We're also moving dns1006 at the same time (we'll move it first while k8 hosts drain) and then we'll move onto moving these! I'll ping you in about 70 minutes for the start of the window at 17:00 GMT. Thanks!

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker1016.eqiad.wmnet completed:

  • wikikube-worker1016.eqiad.wmnet (PASS)
    • Host wikikube-worker1016.eqiad.wmnet depooled from wikikube-eqiad

Change #1206929 had a related patch set uploaded (by Clรฉment Goubert; author: Clรฉment Goubert):

[operations/puppet@production] kubernetes::node: Use netmask to determine network topology

https://gerrit.wikimedia.org/r/1206929

Change #1206929 merged by Clรฉment Goubert:

[operations/puppet@production] kubernetes::node: Use netmask to determine network topology

https://gerrit.wikimedia.org/r/1206929

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker1016.eqiad.wmnet completed:

  • wikikube-worker1016.eqiad.wmnet (PASS)
    • Host wikikube-worker1016.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-worker1016.eqiad.wmnet completed:

  • wikikube-worker1016.eqiad.wmnet (PASS)
    • Host wikikube-worker1016.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1063.eqiad.wmnet completed:

  • wikikube-worker1063.eqiad.wmnet (PASS)
    • Host wikikube-worker1063.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1063.eqiad.wmnet completed:

  • wikikube-worker1063.eqiad.wmnet (PASS)
    • Host wikikube-worker1063.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1305.eqiad.wmnet completed:

  • wikikube-worker1305.eqiad.wmnet (PASS)
    • Host wikikube-worker1305.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1313.eqiad.wmnet completed:

  • wikikube-worker1313.eqiad.wmnet (PASS)
    • Host wikikube-worker1313.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1157.eqiad.wmnet completed:

  • wikikube-worker1157.eqiad.wmnet (PASS)
    • Host wikikube-worker1157.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1305.eqiad.wmnet completed:

  • wikikube-worker1305.eqiad.wmnet (PASS)
    • Host wikikube-worker1305.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1313.eqiad.wmnet completed:

  • wikikube-worker1313.eqiad.wmnet (PASS)
    • Host wikikube-worker1313.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1157.eqiad.wmnet completed:

  • wikikube-worker1157.eqiad.wmnet (PASS)
    • Host wikikube-worker1157.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1157.eqiad.wmnet completed:

  • wikikube-worker1157.eqiad.wmnet (PASS)
    • Host wikikube-worker1157.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1254-1256].eqiad.wmnet completed:

  • wikikube-worker[1254-1256].eqiad.wmnet (PASS)
    • Host wikikube-worker[1254-1256].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1306.eqiad.wmnet completed:

  • wikikube-worker1306.eqiad.wmnet (PASS)
    • Host wikikube-worker1306.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1306.eqiad.wmnet completed:

  • wikikube-worker1306.eqiad.wmnet (PASS)
    • Host wikikube-worker1306.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1254-1256].eqiad.wmnet completed:

  • wikikube-worker[1254-1256].eqiad.wmnet (PASS)
    • Host wikikube-worker[1254-1256].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1254-1256].eqiad.wmnet completed:

  • wikikube-worker[1254-1256].eqiad.wmnet (PASS)
    • Host wikikube-worker[1254-1256].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1306.eqiad.wmnet completed:

  • wikikube-worker1306.eqiad.wmnet (PASS)
    • Host wikikube-worker1306.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1016.eqiad.wmnet completed:

  • wikikube-worker1016.eqiad.wmnet (PASS)
    • Host wikikube-worker1016.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1016.eqiad.wmnet completed:

  • wikikube-worker1016.eqiad.wmnet (PASS)
    • Host wikikube-worker1016.eqiad.wmnet pooled in wikikube-eqiad

IRC Discussion Update:

We moved about half the wikikube workers today after a sync up with Clement and Cathal on the particulars of the wikikube networking. We'll be able to move the remainder on Wednesday and Thursday without direct coordination with Clement since they've updated us with the relevant (de)pooling directions.

@RLazarus and I were looking into verifying workloads returning to the migrated workers, and ran into a few surprises.

Going by what's been marked complete in the sheet from today, there appears to be a difference between the what was depooled vs. what was ostensibly migrated.

Specifically, it looks like the following, where - indicates "in the sheet, not depooled" and + indicates "not in the sheet, depooled" (no mark indicates agreement):

+wikikube-worker1016
 wikikube-worker1063
-wikikube-worker1135
-wikikube-worker1136
-wikikube-worker1137
-wikikube-worker1138
-wikikube-worker1139
-wikikube-worker1154
-wikikube-worker1155
-wikikube-worker1156
 wikikube-worker1157
+wikikube-worker1254
+wikikube-worker1255
+wikikube-worker1256
 wikikube-worker1305
 wikikube-worker1306
 wikikube-worker1313

So, there appear to be 4 workers depooled that weren't in-scope for migration, and 8 workers migrated that weren't depooled (of which 3 look like maybe a typo in the depool command - i.e., wikikube-worker[1254-1256] vs wikikube-worker[1154-1156]).

In any case, as far as we can tell, workloads on the 8 workers appear to have weathered it the disruption.

@RobH - Were the 8 that weren't depooled, but appear to be checked off actually deferred to tomorrow? If not (i.e., they were indeed migrated), is there some way we can help prepare the depool commands ahead of time to make sure it's aligned with the planned migrations?

@RLazarus and I were looking into verifying workloads returning to the migrated workers, and ran into a few surprises.

Going by what's been marked complete in the sheet from today, there appears to be a difference between the what was depooled vs. what was ostensibly migrated.

Specifically, it looks like the following, where - indicates "in the sheet, not depooled" and + indicates "not in the sheet, depooled" (no mark indicates agreement):

+wikikube-worker1016
 wikikube-worker1063
-wikikube-worker1135
-wikikube-worker1136
-wikikube-worker1137
-wikikube-worker1138
-wikikube-worker1139
-wikikube-worker1154
-wikikube-worker1155
-wikikube-worker1156
 wikikube-worker1157
+wikikube-worker1254
+wikikube-worker1255
+wikikube-worker1256
 wikikube-worker1305
 wikikube-worker1306
 wikikube-worker1313

So, there appear to be 4 workers depooled that weren't in-scope for migration, and 8 workers migrated that weren't depooled (of which 3 look like maybe a typo in the depool command - i.e., wikikube-worker[1254-1256] vs wikikube-worker[1154-1156]).

In any case, as far as we can tell, workloads on the 8 workers appear to have weathered it the disruption.

@RobH - Were the 8 that weren't depooled, but appear to be checked off actually deferred to tomorrow? If not (i.e., they were indeed migrated), is there some way we can help prepare the depool commands ahead of time to make sure it's aligned with the planned migrations?

So while working with Clement I did indeed send the depool command for the wrong hosts wikikube-worker[1254-1256] and then migrated the planned hosts wikikube-worker[1154-1156]`). Clement caught my issue AFTER I already migrated them so it was noticed and corrected for future use but didn't help today.

End result is some hosts were migrated without depool `wikikube-worker[1154-1156], and some hosts will be depooled again tomorrow (have been repooled today) and migrated tomorrow wikikube-worker[1254-1256]

Sorry for the confusion, with the hangouts and multiple folks involved I ended up sending wrong command is basically it. Human error from me that since I've now done it once, I am unlikely to do again. Hope that explains everything, if not let me know!

Got it - thanks for clarifying, @RobH! Alright, in that case, let us know if you'd like a second pair of eyes on anything ahead of the next wave of migrations.

I think it'll be ok when we move things tomorrow, since I know exactly the mistake I made I don't think I'll make it again for a few months minimum ; D

The current plan is to start moving some hosts tomorrow after 17:00 GMT, and we'll announce in the #wikimedia-dcops channel and -sre channels in irc before starting.

Depooling wikikube in rack C6:

sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker126[0-9].eqiad.wmnet
sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker10[36,51,52,54,54,55,83].eqiad.wmnet

Covering:
wikikube-worker1260
wikikube-worker1261
wikikube-worker1262
wikikube-worker1263
wikikube-worker1264
wikikube-worker1265
wikikube-worker1266
wikikube-worker1267
wikikube-worker1268
wikikube-worker1269

wikikube-worker1036
wikikube-worker1051
wikikube-worker1052
wikikube-worker1053
wikikube-worker1054
wikikube-worker1055
wikikube-worker1083

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1260-1269].eqiad.wmnet completed:

  • wikikube-worker[1260-1269].eqiad.wmnet (PASS)
    • Host wikikube-worker[1260-1269].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet completed:

  • wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet (PASS)
    • Host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1260-1269].eqiad.wmnet completed:

  • wikikube-worker[1260-1269].eqiad.wmnet (PASS)
    • Host wikikube-worker[1260-1269].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet completed:

  • wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet (PASS)
    • Host wikikube-worker[1036,1051-1052,1054-1055,1083].eqiad.wmnet pooled in wikikube-eqiad

Ran:

sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker126[0-9].eqiad.wmnet
sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad pool wikikube-worker10[36,51,52,54,54,55,83].eqiad.wmnet

They are repooled.

Going to depool wikikube in rack eqiad D1 for port migrations.

sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker11[40,41,60,61].eqiad.wmnet
sudo cookbook sre.k8s.pool-depool-node -t T405950 -r 'network migration' --k8s-cluster wikikube-eqiad depool wikikube-worker127[0-5].eqiad.wmnet

covering:
wikikube-worker1140
wikikube-worker1141
wikikube-worker1160
wikikube-worker1161

wikikube-worker1270
wikikube-worker1271
wikikube-worker1272
wikikube-worker1273
wikikube-worker1274
wikikube-worker1275

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed:

  • wikikube-worker[1140-1141,1160-1161].eqiad.wmnet (PASS)
    • Host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1270-1275].eqiad.wmnet completed:

  • wikikube-worker[1270-1275].eqiad.wmnet (PASS)
    • Host wikikube-worker[1270-1275].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet completed:

  • wikikube-worker[1140-1141,1160-1161].eqiad.wmnet (PASS)
    • Host wikikube-worker[1140-1141,1160-1161].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1270-1275].eqiad.wmnet completed:

  • wikikube-worker[1270-1275].eqiad.wmnet (PASS)
    • Host wikikube-worker[1270-1275].eqiad.wmnet pooled in wikikube-eqiad

@brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1009 other than: "Coordinate with ServiceOps, some badly written clients may need a kick"

We would like to migrate this on either Thursday, Nov 20 or next week Nov 24-26, earlier in the week preferred. Would you detail what needs to be done for this to happen?

If we need to run any depool commands or if you can depool it from use for one of those days, please let us know!

conf1009 is (1) a member of eqiad main-etcd cluster, so clients will attempt to issue writes to it, (2) the upstream source for etcd-mirror replication to codfw, and (3) not a PyBal config host. Also, it sounds like the connectivity disruption should be very brief, similar to previous migrations (i.e., switch-side of the cable just moves from adjacent old device to new).

Given that, the simplest reasonable sequence would be:

  • Silence EtcdReplicationDown
  • Migrate to the new switch
  • On "done" from DC-Ops, in parallel:
    • Test conf1009 (e.g., check cluster membership on peers, probe conf1009 with a quorum-read)
    • Check the health of etcd-mirror on conf2005 and restart if needed
  • Delete silence
  • Restart eqiad-associated confds and navtiming, verify Liberica control-plane daemons are healthy

Rationale:

  • This is only a single cluster member (i.e., the cluster will maintain quorum, though there may be an election), so there's no strong justification to temporarily point eqiad-associated clients at codfw, particularly given the amount of effort and disruption (e.g., read-only periods) involved.
  • We could temporarily point etcd-mirror at a different member, but I don't think that's worth the effort either. The timed-watch strategy it uses tends to be remarkably robust to "temporarily pull the network cable" (more so than to a proper upstream connection close). Also, the "check and restart if needed" approach worked well when the recent round of nginx updates reached conf1009.
  • There is some risk that if replication is disrupted and a large influx of writes pushes us out of the 1000 event log window, then recovery is a bit involved. We could rule this out by making etcd read-only during the migration. However, that turns the "oops, your client hit a transient error attempting to communicate with conf1009" into "full write unavailability," which again seems rather disruptive for a low-likelihood issue.

In any case, I'll give some thought to timing and follow up in the sheet. It's only the last point that I'm still on the fence about (i.e., whether to bracket the migration with read-only), but (1) that's technically easy to do if we choose to do it and (2) should not have much of an impact on scheduling (it mostly just requires additional upfront comms).

@Scott_French: This seems a bit complex in terms of the checking, so lets definitely set this up for a schedule that works for you as well as us. We're nearing the end of the migration (only 44 hosts left overall) so I'd like to schedule the migration of conf1009 with you (or whoever on your team you think best.)

Potential Migration Dates:

We don't want to move anything the day before a holiday or weekend, as it doesn't allow for a followup fix if anything strange occurs. Additionally I'll be out of the office on December 1st and 2nd. As a result, the following migration dates are available depending on your teams service needs.

That leaves us with: 2025-11-20, 2025-11-21, 2025-12-03, 2025-12-04. If possible, I'd like to move everything and get this done by the first week of December.

@RobH - Thanks for checking!

I'll also be out 12-01. I see you mentioned 11-21, but that's Friday. Did you mean Monday 11-24?

If so, that (11-24) sounds great to me, or tomorrow (Thursday) 11-20 would also probably work as long as we can give folks enough advance warning, since this can be a bit disruptive.

Also, just to confirm - there are no cookbooks that you run during the disruptive portion of the migration, correct? (i.e., it's just a netbox change + homer run to update the new ToR immediately before)

I totally messed up the dates on your comment: 2025-11-20, 2025-11-24, 2025-11-25 2025-12-03, 2025-12-04

So yeah, we can plan for the 24th (monday) no problem!

Ack, Monday 2025-11-24 it is for conf1009. Thank you!

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker1034.eqiad.wmnet completed:

  • wikikube-worker1034.eqiad.wmnet (PASS)
    • Host wikikube-worker1034.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed:

  • wikikube-worker[1159,1162-1163].eqiad.wmnet (PASS)
    • Host wikikube-worker[1159,1162-1163].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker1034.eqiad.wmnet completed:

  • wikikube-worker1034.eqiad.wmnet (PASS)
    • Host wikikube-worker1034.eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1159,1162-1163].eqiad.wmnet completed:

  • wikikube-worker[1159,1162-1163].eqiad.wmnet (PASS)
    • Host wikikube-worker[1159,1162-1163].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet completed:

  • wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet (PASS)
    • Host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 depool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet completed:

  • wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet (PASS)
    • Host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet completed:

  • wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet (PASS)
    • Host wikikube-worker[1107-1110,1164-1165,1167-1168].eqiad.wmnet pooled in wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by robh@cumin2002 pool for host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet completed:

  • wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet (PASS)
    • Host wikikube-worker[1004,1019-1020,1037,1067-1071,1096-1097].eqiad.wmnet pooled in wikikube-eqiad

Please note all wikikube workers have been migrated and we're now down to only 4 hosts left with ServiceOps new to migrate:

wikikube-ctrl1003
kafka-main1008
kafka-main1009
conf1009

All 4 of these have notes requiring depooling and other service options that would best be done at a set date/time for both a point of contact in ServiceOps new and both myself and John.

As these appear to be complex, I'm assuming moving on a Friday is a bad idea. Please correct me if wrong.

Suggested Scheduling: Monday, 2025-11-24 and Tuesday, 2025-11-25 would allow us to split up the kafka-main migrations if needed. If not, we can easily accomodate moving all 4 hosts on a single day from the DC-Ops side of things. For both myself and John to overlap, migrating these hosts starting at 17:00 GMT on either day would work best for us.

Please review and provide feedback.

@RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest starting no later than 16:30 (and pausing by 17:00) or starting at / after 18:00.

@RobH - Confirming conf1009 for 2025-11-24, but the SRE staff meeting runs from 17:00 - 18:00 UTC. I'd suggest starting no later than 16:30 (and pausing by 17:00) or starting at / after 18:00.

I neglected to take that into account, thank you for noticing it! Let's then plan to do this migration on 2025-11-24 @ 18:15 GMT. This gives folks time to finish the SRE meeting, take a 15 minute break, then move into the migration window. Sound good?

18:15 UTC sounds good to me. Thank you!

Can we also set a date/time for moving the other three hosts remaining?

wikikube-ctrl1003
kafka-main1008
kafka-main1009

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:16:37Z] <swfrench-wmf> silenced EtcdReplicationDown. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:21:08Z] <swfrench-wmf> manually transferred etcd-mirror replication source to conf1008 - T405950

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:23:32Z] <swfrench@deploy2002> Locking from deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950

conf1009 migrated,

@brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as these are the last ServiceOps new hosts to migrate:

@brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1009 other than: "Coordinate with ServiceOps, some badly written clients may need a kick"

We would like to migrate this on either Thursday, Nov 20 or next week Nov 24-26, earlier in the week preferred. Would you detail what needs to be done for this to happen?

If we need to run any depool commands or if you can depool it from use for one of those days, please let us know!

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:31:39Z] <swfrench-wmf> manually transferred etcd-mirror replication source back to conf1009 - T405950

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:32:15Z] <swfrench@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: Hold deployments during etcd ToR switch migration - T405950 (duration: 08m 43s)

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:34:34Z] <swfrench-wmf> begin restarts of eqiad-associated confds, navtiming, requestctl - T405950

Mentioned in SAL (#wikimedia-operations) [2025-11-24T18:36:21Z] <swfrench-wmf> deleted EtcdReplicationDown silence. f75c71c9-62d3-449f-860a-9b5e4570717a - T405950

conf1009 migrated,

@brouberol: Please provide feedback on migration of wikikube-ctrl1003 and kafka-main1008 as these are the last ServiceOps new hosts to migrate:

@brouberol, you were tagged into this task by T405950#11236474 but I don't have any feedback on the migration details for kafka-main1009 other than: "Coordinate with ServiceOps, some badly written clients may need a kick"

We would like to migrate this on either Thursday, Nov 20 or next week Nov 24-26, earlier in the week preferred. Would you detail what needs to be done for this to happen?

If we need to run any depool commands or if you can depool it from use for one of those days, please let us know!

Correction, only need feedback from @brouberol for kafka-main1008!

IRC Echo Update (chatting with Scott in irc about this just echoing to task for history):

  • We want to get feedback from @brouberol on migration of kafka-main1008 and kafka-main1009.
    • Exactly what depool and repool directions are for us to do, or set a date/time for coordination with @Jclark-ctr to migrate the host.
  • @Scott_French will followup with ServiceOps new team on how to best migrate wikikube-ctrl1003 and will update this task with either directions for DC ops to handle or a date/time for the migration.

Thanks!

I've chatted with @brouberol via IRC:

11:50 <brouberol> kafka hosts can be shut down / disconnected from the network, but not more than one at a time, to be safe. The metric you want to look at to make sure you can proceed is "under replicated partitions", in https://grafana.wikimedia.org/d/000000027/kafka (scoped to kafka-main-eqiad). If that is the case, then you should be able to move the host.
11:50 <brouberol> When it's back up, the number of under replicated partitions will go back down to 0. Wait a good 10 minutes for good measure, and then proceed with the 2nd one.
11:53 <brouberol> if the # of under replicated partitions is > 0, consider that SRE are on it, and the move cannot proceed until it's back to 0

12:00 <brouberol> One thing to note is that I'd advise evacuating the broker's leadership, with topicmappr rebuild --brokers -2 --topics '.*' --leader-evac-brokers <broker-id> before the operation, rolling back after the broker is back in sync
12:02
<brouberol> this is really to be extra careful. If the network disconnection is only going to be <30s, we can _probably_ rawdog it a bit, but if we really wanted to take precautions, we'd evacuate the leadership to other brokers, disconnect it, wait for the broker to back in sync, revert the partition state to where it was before the operation, for each broker
12:06 <brouberol> To be tested by an SRE before running through the real thing, but from memory, the exact set of commands would be
12:06 <brouberol> - topicmappr rebuild --brokers -2 --topics '.*' --leader-evac-brokers <broker-id> --out-file leadership-evac.json
12:06 <brouberol> - kafka reassign-partitions --reassignment-json-file ./leadership-evac.json --execute # copy the JSON printed to the console to leadership-evac-rollback.json
12:06 <brouberol> - kafka reassign-partitions --reassignment-json-file ./leadership-evac.json --verify
12:06 <brouberol> - kafka preferred-replica-election
12:06 <brouberol> When the broker is back on the network and back to being in sync, then
12:06 <brouberol> - kafka reassign-partitions --reassignment-json-file ./leadership-evac-rollback.json --execute
12:06 <brouberol> - kafka reassign-partitions --reassignment-json-file ./leadership-evac-rollback.json --verify
12:06 <brouberol> - kafka preferred-replica-election
12:08 <brouberol> (wait a good 30s between the --execute and the --verify command. The --execute command will only induce metadata changes, and no data transfer, so this should be really fast)

With this info, we should be good to move these two hosts on Tuesday, Nov 25th. As @brouberol has provided the info but these hosts still fall under ServiceOps new I also want to ping in @Clement_Goubert or @CDanis so one of them could be around for the kafka reassignment commands above for the scheduled window. Can you review and advise back when this would work for you: either Tuesday, Nov 23 or next week on Wednesday December 3rd?

Mentioned in SAL (#wikimedia-operations) [2025-11-25T15:50:43Z] <claime> Eviction partition leadership from kafka-main1008 - T405950

Mentioned in SAL (#wikimedia-operations) [2025-11-25T16:06:53Z] <claime> Eviction partition leadership from kafka-main1009 - T405950

Both kafka-main100[89] moved, last one to move is wikikube-ctrl1003

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-ctrl1003.eqiad.wmnet completed:

  • wikikube-ctrl1003.eqiad.wmnet (PASS)
    • Host wikikube-ctrl1003.eqiad.wmnet depooled from wikikube-eqiad

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-ctrl1003.eqiad.wmnet completed:

  • wikikube-ctrl1003.eqiad.wmnet (PASS)
    • Host wikikube-ctrl1003.eqiad.wmnet pooled in wikikube-eqiad

All ServiceOps hosts have been migrated to the new switch.