Page MenuHomePhabricator

setup/install kubestage100[34]
Open, Needs TriagePublic

Description

New nodes kubestage100[34] have been handed over by DC-Ops and need to be setup/added do the cluster.

These are replacements for kubestage100[12], so those need to be decommissioned afterwards.

I don't think we have a proper documentation on how to do that (in Kubernetes context). That should be an outcome of this as well.
We actually had a bit of documentation from the "Create a new Cluster" perspective. I moved that out and extended it a bit (but I might not have caught every step/aspect): https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/Add_or_remove_nodes

After this is done, we should remote the workaround for mediawiki images in CI: T284628

Event Timeline

Want help with this? I / we could put OS and role on it, see if any puppet issues, then meet with you to go through the actual adding-to-the-cluster part in a shared session. We'd and take notes for docs.

Change 739857 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] site: include new k8s hosts on kubestage group

https://gerrit.wikimedia.org/r/739857

Change 739879 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/homer/public@master] sites: add new kubestage nodes

https://gerrit.wikimedia.org/r/739879

Change 739857 merged by AOkoth:

[operations/puppet@production] site: include new k8s hosts on kubestage group

https://gerrit.wikimedia.org/r/739857

Change 740314 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] hieradata: add kubestage bgp peers

https://gerrit.wikimedia.org/r/740314

Change 740314 merged by AOkoth:

[operations/puppet@production] hieradata: add kubestage bgp peers

https://gerrit.wikimedia.org/r/740314

Change 739879 merged by jenkins-bot:

[operations/homer/public@master] sites: add new kubestage nodes

https://gerrit.wikimedia.org/r/739879

Mentioned in SAL (#wikimedia-operations) [2021-11-23T09:57:34Z] <jayme> cordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-23T11:05:42Z] <jayme> uncordoned kubestage1001.eqiad.wmnet kubestage1002.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-23T11:05:55Z] <jayme> cordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet (we have issues with POD IP prefix allocation) - T293729

Mentioned in SAL (#wikimedia-operations) [2021-11-25T14:25:08Z] <jayme> uncordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet - T293729

@Arnoldokoth the new nodes now have a ipam block assigned (I moved some test workload there to verify). From my POV you can continue with this when you have time (decom kubestage100[12]).