Page MenuHomePhabricator

Wikikube CPU capacity issue
Closed, ResolvedPublic

Description

We're encountering issues deploying low-replica releases (canary and mw-debug) of mw-on-k8s.

0/22 nodes are available: 16 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate.

This is due to the sum of requests from the deployments on wikikube being over the number of available CPU.

This blocks T342748: mw-on-k8s app container CPU throttling at low average load remediation which raises requests for mw-on-k8s releases.
It has been emergency-mitigated by artificially lowering the requests for canary releases and the mw-debug deployment in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/944229

This is not a permanent solution however, and in the absence of T264625: Deploy kube-state-metrics to have more precise data, and T342533: Q1:rack/setup/install kubernetes10[27-56] to have new hardware, we should remediate by re-imaging a few servers from the appserver cluster to kubernetes workers.

Currently, the eqiad wikikube cluster has 15 more pods than codfw, so a couple nodes extra for it compared to codfw seems justified.

Event Timeline

appservers mw145[1-2].eqiad.wmnet to be renamed and reimaged to kubernetes102[5-6].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:09:09Z] <claime> Depool mw1451 and mw1452 for reimage as wikikube nodes - T343306

Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:13:27Z] <claime> Repool mw1451 and mw1452, more recent servers will be used - T343306

mw145[1-2].eqiad.wmnet are actually older generation than what we want for wikikube, I will be taking mw1497 and mw1498 instead.

Mentioned in SAL (#wikimedia-operations) [2023-08-02T12:19:12Z] <claime> Depool mw1497 and mw1498 for reimage as wikikube nodes - T343306

Change 944893 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] dsh: Remove mw1497 and mw1498 from appserver

https://gerrit.wikimedia.org/r/944893

Change 944895 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] site.pp: Rename mw149[7-8] to kubernetes102[5-6]

https://gerrit.wikimedia.org/r/944895

Change 944893 merged by Clément Goubert:

[operations/puppet@production] dsh: Remove mw1497 and mw1498 from appserver

https://gerrit.wikimedia.org/r/944893

Mentioned in SAL (#wikimedia-operations) [2023-08-02T13:56:54Z] <claime> Decomissioning mw1497 and mw1498 - T343306

cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: mw[1497-1498].eqiad.wment

  • mw1497.eqiad.wment (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Confirmation manually aborted
  • mw1498.eqiad.wment (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Confirmation manually aborted
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: mw[1497-1498].eqiad.wmnet

  • mw1497.eqiad.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.2.229
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw1498.eqiad.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.2.230
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

The failures above are due to the bad first run of the cookbook due to operator (me) error. The hosts will be wiped by the soon to follow reimage.

Network netbox changes done

kubernetes1025

2013339101888 	lsw1-f3-eqiad (WMF11409) ge-0/0/26

kubernetes1026 

2013339101893 	lsw1-f3-eqiad (WMF11409) ge-0/0/27

Will proceed with reimage tomorrow once https://gerrit.wikimedia.org/r/c/operations/puppet/+/944895 is merged.

Mentioned in SAL (#wikimedia-operations) [2023-08-03T09:03:43Z] <claime> Deploying rename changes for mw149[7-8] to kubernetes102[5-6] - T343306

Change 944895 merged by Clément Goubert:

[operations/puppet@production] Rename mw149[7-8] to kubernetes102[5-6]

https://gerrit.wikimedia.org/r/944895

Change 945547 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/homer/public@master] Add kubernetes102[5,6] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/945547

Change 945547 merged by jenkins-bot:

[operations/homer/public@master] Add kubernetes102[5,6] to its k8s_neighbors list

https://gerrit.wikimedia.org/r/945547

Mentioned in SAL (#wikimedia-operations) [2023-08-03T15:02:17Z] <claime> Run homer on lsw1-f3-eqiad for kubernetes102[5-6] imaging - T343306

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1025.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2023-08-03T16:13:31Z] <cgoubert@cumin1001> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Rename kubernetes10[25-26] - cgoubert@cumin1001 - T343306"

Mentioned in SAL (#wikimedia-operations) [2023-08-03T16:14:28Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Rename kubernetes10[25-26] - cgoubert@cumin1001 - T343306"

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1025.eqiad.wmnet with OS bullseye completed:

  • kubernetes1025 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308031619_cgoubert_2185170_kubernetes1025.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet,puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host kubernetes1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host kubernetes1026.eqiad.wmnet with OS bullseye completed:

  • kubernetes1026 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308040734_cgoubert_2361820_kubernetes1026.out
    • Unable to run puppet on config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet,puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)
root@deploy1002:~# kubectl get nodes | grep -E '102(5|6)'
kubernetes1025.eqiad.wmnet   Ready    <none>   16h    v1.23.14
kubernetes1026.eqiad.wmnet   Ready    <none>   70m    v1.23.14

Nodes ready, I'll revert the requests trickery on Monday.

Requests reverted, resolving.