Page MenuHomePhabricator

wikikube-worker2[248-331] implementation tracking
Closed, ResolvedPublic

Description

wikikube-worker2[248-331] implementation tracking

This task is to track the service implementation of ServiceOps new host(s) listed in the task description.

Once the linked racking task has been resolved, this task can be implemented.

This sub-task creation/update is per the request of ServiceOps new; this task is assigned at creation to the 'Sub-team Technical Contact' provided in the initial ordering task.
1.) Extend the hostname globs as appropriate in puppet/manifests/site.pp
2.) Run ./add_k8s_node.py --netbox-token $NETBOX_TOKEN --task-id T390859 wikikube-worker2{248..330}.codfw.wmnet
3.) Run the reimage cookbook:

sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2248
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2249
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2250
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2251
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2252
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2253
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2254
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2255
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2256
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2257
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2258
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2259
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2260
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2261
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2262
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2263
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2264
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2265
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2266
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2267
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2268
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2269
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2270
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2271
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2272
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2273
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2274
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2275
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2276
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2277
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2278
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2279
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2280
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2281
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2282
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2283
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2284
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2285
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2286
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2287
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2288
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2289
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2290
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2291
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2292
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2293
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2294
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2295
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2296
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2297
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2298
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2299
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2300
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2301
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2302
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2303
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2304
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2305
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2306
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2307
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2308
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2309
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2310
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2311
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2312
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2313
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2314
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2315
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2316
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2317
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2318
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2319
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2320
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2321
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2322
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2323
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2324
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2325
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2326
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2327
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2328
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2329
sudo cookbook sre.hosts.reimage -t T390859 --os bookworm wikikube-worker2330

4.) Update Netbox' (remember to run homer afterwards and !log your action on #wikimedia-operations):
./add_k8s_node.py --netbox-token $NETBOX_TOKEN --netbox-commit --task-id T390859 wikikube-worker2{248..330}.codfw.wmnet
5.) Pool the new nodes:
sudo cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-codfw -t T390859 pool wikikube-worker2[248-330].codfw.wmnet
6.) Depool and create the decom task for
wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101,2116-2123,2216-2241].codfw.wmnet

Note Intentionally leaving out wikikube-worker2331 for @elukey because of SuperMicro firmware issues.

Details

Event Timeline

Given the nodes should already have the correct partman receipt applied a reimage should not be required. Role change and puppet run is enough in that case.

Clement_Goubert triaged this task as Medium priority.

Tagging @jasmine_ as primary for redacting the patches, but we should definitely split the actual reimages.

Change #1181753 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] wikikube: Add wikikube-worker2[248-330]

https://gerrit.wikimedia.org/r/1181753

AFAICT, @JMeybohm is correct, no reimages are required. We just need to proceed according to the docs, in short:

  1. set BGP = True in Netbox -- for inspiration for automating this, see e.g. serviceops-kitchensink/add_k8s_node.py#L130 and run homer
  2. merge the above patch and run puppet on
    • Docker-registry
    • current nodes
    • new nodes
  3. check that the nodes are uncordoned
  4. pool the nodes with sudo cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-eqiad -t T390859 pool wikikube-worker2[248-330].codfw.wmnet

@jasmine_ has already prepared the patch, I suggest that the two of us pair on this, with @jasmine_ taking the lead.

In T390859#11281032, @kamila wrote:

AFAICT, @JMeybohm is correct, no reimages are required. We just need to proceed according to the docs, in short:

  1. set BGP = True in Netbox -- for inspiration for automating this, see e.g. serviceops-kitchensink/add_k8s_node.py#L130 and run homer

You can also run the script ./add_k8s_node.py --netbox-token $NETBOX_TOKEN --netbox-commit --task-id T390859 wikikube-worker2{248..330}.codfw.wmnet with a valid netbox token and it'll do it automatically and tell you what homer commands to run.

The first invocation from the task description can also probably be used to get the baseline puppet patch, although I'd have to test that again.

  1. merge the above patch and run puppet on
    • Docker-registry
    • current nodes
    • new nodes

Yep. Current nodes + control planes.

  1. check that the nodes are uncordoned
  2. pool the nodes with sudo cookbook sre.k8s.pool-depool-node --k8s-cluster wikikube-eqiad -t T390859 pool wikikube-worker2[248-330].eqiad.wmnet

pool-depool-node will uncordon as well.

@jasmine_ has already prepared the patch, I suggest that the two of us pair on this, with @jasmine_ taking the lead.

Good idea, ping me if needed. I suggest doing the pooling/depooling of the corresponding future decoms in batches though so we don't fluctuate the cluster capacity too much.

You can also run the script ./add_k8s_node.py --netbox-token $NETBOX_TOKEN --netbox-commit --task-id T390859 wikikube-worker2{248..330}.codfw.wmnet with a valid netbox token and it'll do it automatically and tell you what homer commands to run.

The first invocation from the task description can also probably be used to get the baseline puppet patch, although I'd have to test that again.

My bad, a PEBKAC made me think that that doesn't work. It does.

Yep. Current nodes + control planes.

pool-depool-node will uncordon as well.

Good points, thanks!

@jasmine_ has already prepared the patch, I suggest that the two of us pair on this, with @jasmine_ taking the lead.

Good idea, ping me if needed. I suggest doing the pooling/depooling of the corresponding future decoms in batches though so we don't fluctuate the cluster capacity too much.

Will do, thanks!

For decoms/node removal, I agree that smaller batches are a good idea. For adding new nodes, I was at first thinking the same, but I can't come up with a reason why adding a lot of nodes would actually break something. Let me know if I'm forgetting something, otherwise I'll find out :D

In T390859#11281084, @kamila wrote:

You can also run the script ./add_k8s_node.py --netbox-token $NETBOX_TOKEN --netbox-commit --task-id T390859 wikikube-worker2{248..330}.codfw.wmnet with a valid netbox token and it'll do it automatically and tell you what homer commands to run.

The first invocation from the task description can also probably be used to get the baseline puppet patch, although I'd have to test that again.

My bad, a PEBKAC made me think that that doesn't work. It does.

Yep. Current nodes + control planes.

pool-depool-node will uncordon as well.

Good points, thanks!

@jasmine_ has already prepared the patch, I suggest that the two of us pair on this, with @jasmine_ taking the lead.

Good idea, ping me if needed. I suggest doing the pooling/depooling of the corresponding future decoms in batches though so we don't fluctuate the cluster capacity too much.

Will do, thanks!

For decoms/node removal, I agree that smaller batches are a good idea. For adding new nodes, I was at first thinking the same, but I can't come up with a reason why adding a lot of nodes would actually break something. Let me know if I'm forgetting something, otherwise I'll find out :D

I wasn't really thinking it would break something, but more about the general rule that added capacity gets used, and pulling it back can be difficult. I don't think we would end up in this situation given we control most of the replica counts etc. but process-wise I think it's good to keep in mind.

Mentioned in SAL (#wikimedia-operations) [2025-10-29T19:12:09Z] <jasmine> 'homer on multiple lsw1-*-codfw* 'T390859''

Change #1181753 merged by Jasmine:

[operations/puppet@production] wikikube: Add wikikube-worker2[248-330]

https://gerrit.wikimedia.org/r/1181753

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 pool for host wikikube-worker[2248-2267].codfw.wmnet completed:

  • wikikube-worker[2248-2267].codfw.wmnet (PASS)
    • Host wikikube-worker[2248-2267].codfw.wmnet pooled in wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 depool for host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet completed:

  • wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet (PASS)
    • Host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet depooled from wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 pool for host wikikube-worker[2268-2287].codfw.wmnet completed:

  • wikikube-worker[2268-2287].codfw.wmnet (PASS)
    • Host wikikube-worker[2268-2287].codfw.wmnet pooled in wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 depool for host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet completed:

  • wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet (PASS)
    • Host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet depooled from wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 pool for host wikikube-worker[2288-2299].codfw.wmnet completed:

  • wikikube-worker[2288-2299].codfw.wmnet (PASS)
    • Host wikikube-worker[2288-2299].codfw.wmnet pooled in wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 depool for host wikikube-worker[2230-2241].codfw.wmnet completed:

  • wikikube-worker[2230-2241].codfw.wmnet (PASS)
    • Host wikikube-worker[2230-2241].codfw.wmnet depooled from wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 pool for host wikikube-worker[2300-2319].codfw.wmnet completed:

  • wikikube-worker[2300-2319].codfw.wmnet (PASS)
    • Host wikikube-worker[2300-2319].codfw.wmnet pooled in wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 pool for host wikikube-worker[2320-2330].codfw.wmnet completed:

  • wikikube-worker[2320-2330].codfw.wmnet (PASS)
    • Host wikikube-worker[2320-2330].codfw.wmnet pooled in wikikube-codfw

Cookbook cookbooks.sre.k8s.pool-depool-node started by jasmine@cumin1002 depool for host wikikube-worker[2116-2123,2216-2230].codfw.wmnet completed:

  • wikikube-worker[2116-2123,2216-2230].codfw.wmnet (PASS)
    • Host wikikube-worker[2116-2123,2216-2230].codfw.wmnet depooled from wikikube-codfw

hosts have been added to the cluster :) thanks for the reviews @Raine & @Clement_Goubert!

tracking:

T409102 wikikube-worker[2003-2004,2007-2010,2019-2032,2040,2043,2045,2048].codfw.wmnet
T409103 wikikube-worker[2052-2054,2063,2079-2084,2096-2101].codfw.wmnet
T409104 wikikube-worker[2116-2123,2216-2241].codfw.wmnet