Page MenuHomePhabricator

setup/install kubestage100[34]
Closed, ResolvedPublic

Description

New nodes kubestage100[34] have been handed over by DC-Ops and need to be setup/added do the cluster.

These are replacements for kubestage100[12], so those need to be decommissioned afterwards.

I don't think we have a proper documentation on how to do that (in Kubernetes context). That should be an outcome of this as well.
We actually had a bit of documentation from the "Create a new Cluster" perspective. I moved that out and extended it a bit (but I might not have caught every step/aspect): https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes

After this is done, we should remote the workaround for mediawiki images in CI: T284628

Event Timeline

Want help with this? I / we could put OS and role on it, see if any puppet issues, then meet with you to go through the actual adding-to-the-cluster part in a shared session. We'd and take notes for docs.

Change 739857 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] site: include new k8s hosts on kubestage group

https://gerrit.wikimedia.org/r/739857

Change 739879 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/homer/public@master] sites: add new kubestage nodes

https://gerrit.wikimedia.org/r/739879

Change 739857 merged by AOkoth:

[operations/puppet@production] site: include new k8s hosts on kubestage group

https://gerrit.wikimedia.org/r/739857

Change 740314 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] hieradata: add kubestage bgp peers

https://gerrit.wikimedia.org/r/740314

Change 740314 merged by AOkoth:

[operations/puppet@production] hieradata: add kubestage bgp peers

https://gerrit.wikimedia.org/r/740314

Change 739879 merged by jenkins-bot:

[operations/homer/public@master] sites: add new kubestage nodes

https://gerrit.wikimedia.org/r/739879

Mentioned in SAL (#wikimedia-operations) [2021-11-23T09:57:34Z] <jayme> cordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-23T11:05:42Z] <jayme> uncordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-23T11:05:55Z] <jayme> cordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-25T14:25:08Z] <jayme> uncordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet - T293729

@Arnoldokoth the new nodes now have a ipam block assigned (I moved some test workload there to verify). From my POV you can continue with this when you have time (decom kubestage100[12]).

Change 748781 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/deployment-charts@master] changeprop: increase memory limit for staging

https://gerrit.wikimedia.org/r/748781

Change 748781 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: increase memory limit for staging

https://gerrit.wikimedia.org/r/748781

cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: kubestage1001.eqiad.wmnet

  • kubestage1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: kubestage1002.eqiad.wmnet

  • kubestage1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Change 751752 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] kubernetes: remove kubestage1001 & kubestage1002

https://gerrit.wikimedia.org/r/751752

Change 751754 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/homer/public@master] kubernetes: remove kubestage1001 & kubestage1002

https://gerrit.wikimedia.org/r/751754

Change 751976 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/dns@master] kubernetes: point to new kubestage node

https://gerrit.wikimedia.org/r/751976

Change 751976 merged by AOkoth:

[operations/dns@master] kubernetes: point to new kubestage node

https://gerrit.wikimedia.org/r/751976

Change 751752 merged by AOkoth:

[operations/puppet@production] kubernetes: remove kubestage1001 & kubestage1002

https://gerrit.wikimedia.org/r/751752

Change 751754 merged by AOkoth:

[operations/homer/public@master] kubernetes: remove kubestage1001 & kubestage1002

https://gerrit.wikimedia.org/r/751754

akosiaris added a subscriber: akosiaris.

This has been done, resolving!