Page MenuHomePhabricator

Create dse-k8s-etcd cluster in codfw
Closed, ResolvedPublic

Description

We are building a dse-k8s-codfw Kubernetes cluster, so we will need an etcd cluster for it.

This will comprise three Ganeti VMs, just like the cluster in eqiad.

Follow the guidelines here: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#etcd

Note that DRBD should be disabled on the VMs

Event Timeline

Please use A, B and D for the new nodes. I recently added a feature to the sre.ganeti.makevm cookbook to directly request plain storage by passing "--storage_type plain", previously it was always usinng DRBD and then the disk type had to be changed retroactively.

Change #1167209 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add the new dse-k8s hosts to site.pp so that we can create the VMs

https://gerrit.wikimedia.org/r/1167209

Change #1167209 merged by Btullis:

[operations/puppet@production] Add the new dse-k8s hosts to site.pp so that we can create the VMs

https://gerrit.wikimedia.org/r/1167209

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-etcd2001.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-etcd2001.codfw.wmnet with OS bookworm completed:

  • dse-k8s-etcd2001 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507081421_btullis_816983_dse-k8s-etcd2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-etcd2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-etcd2002.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-etcd2003.codfw.wmnet with OS bookworm completed:

  • dse-k8s-etcd2003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507081521_btullis_836567_dse-k8s-etcd2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-etcd2002.codfw.wmnet with OS bookworm completed:

  • dse-k8s-etcd2002 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507081531_btullis_834139_dse-k8s-etcd2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@Stevemunene - You'll see that I provisioned these three VMs for you, to save a bit of time. I was able to use the --storage_type plain option when creating them, so they are ready to go.
I think that you can follow these guidelines to make the cluster itself. https://wikitech.wikimedia.org/wiki/Etcd and with reference to https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/etcd/v3/dse_k8s_etcd.yaml

Change #1170364 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/dns@master] dns: Add dse-k8s codfw urls

https://gerrit.wikimedia.org/r/1170364

@Stevemunene - You'll see that I provisioned these three VMs for you, to save a bit of time. I was able to use the --storage_type plain option when creating them, so they are ready to go.
I think that you can follow these guidelines to make the cluster itself. https://wikitech.wikimedia.org/wiki/Etcd and with reference to https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/etcd/v3/dse_k8s_etcd.yaml

Since we have https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/etcd/v3/dse_k8s_etcd.yaml#L4 already which needs to be profile::etcd::v3::cluster_bootstrap: true

Could we have the cluster as a new one dse_k8s_etcd_codfw for the setup? and a Kubernetes POD IP delegation?

I think that the simplest way to handle this is to override the profile::etcd::v3::cluster_bootstrap value at the host level in hiera, while you are bootstrapping the cluster.

If we look at one of the other similar clusters, e.g. aux-k8s-etcd then we can see that this was the approach taken:

This patch set the three hosts into bootstrap mode: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117219
Then it was reverted in: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117235

Hope that helps.

Change #1170514 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: bootstrap dse-k8s-codefw cluster

https://gerrit.wikimedia.org/r/1170514

Change #1170364 merged by Stevemunene:

[operations/dns@master] dns: Add dse-k8s codfw SRV records

https://gerrit.wikimedia.org/r/1170364

Change #1170514 merged by Stevemunene:

[operations/puppet@production] dse-k8s: bootstrap dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1170514

Change #1171584 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: deploy etcd service

https://gerrit.wikimedia.org/r/1171584

Change #1171592 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/dns@master] dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/1171592

Change #1171592 merged by Stevemunene:

[operations/dns@master] dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/1171592

Change #1171584 merged by Stevemunene:

[operations/puppet@production] dse-k8s: deploy etcd service

https://gerrit.wikimedia.org/r/1171584

Change #1172617 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add dse-k8s-codfw site

https://gerrit.wikimedia.org/r/1172617

Change #1172617 merged by Stevemunene:

[operations/puppet@production] Add dse-k8s-codfw site definition

https://gerrit.wikimedia.org/r/1172617

Change #1172619 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add dse-k8s-codfw etcd configuration

https://gerrit.wikimedia.org/r/1172619

Change #1172619 merged by Bking:

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw k8s configuration

https://gerrit.wikimedia.org/r/1172619

Change #1173914 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw etcd cluster configuration

https://gerrit.wikimedia.org/r/1173914

Change #1173914 merged by Stevemunene:

[operations/puppet@production] dse-k8s: Add dse-k8s-codfw etcd cluster configuration

https://gerrit.wikimedia.org/r/1173914

Change #1178526 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: bootstrap dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1178526

Change #1178526 merged by Stevemunene:

[operations/puppet@production] dse-k8s: bootstrap dse-k8s-codfw cluster

https://gerrit.wikimedia.org/r/1178526

Change #1178534 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] dse-k8s: dibable dse-k8s-codfw bootstrap

https://gerrit.wikimedia.org/r/1178534

Change #1178534 merged by Stevemunene:

[operations/puppet@production] dse-k8s: disable dse-k8s-codfw bootstrap

https://gerrit.wikimedia.org/r/1178534

The etcd cluster is bootstrapped, moving on to Createing helmfile.d/admin_ng structure required to bootstrap the dse-k8s-codfw cluster on T397297

To bootstrap the etcd cluster the following steps were followed.

  1. Deploying the srv DNS records to allow the nodes to know about each other dns: Add dse-k8s codfw SRV records and adding a VIP for dse-k8s-ctrl dns: Add a VIP for dse-k8s-ctrl.svc.codfw.wmnet
  2. Adding the hosts to the role(etcd::v3::dse_k8s_etcd)
  3. 1172617: Add dse-k8s-codfw site definition | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172617
  4. 1172619: dse-k8s: Add dse-k8s-codfw k8s configuration | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1172619
  5. Bootstrapping the cluster by assigning the profile profile::etcd::v3 to the servers roles, the cluster_bootstrap value must be set to true for the initial run profile::etcd::v3::cluster_bootstrap: true which was done on 1178526: dse-k8s: bootstrap dse-k8s-codfw cluster | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178526

Once merged, run puppet on the new hosts then set the boostrap flag back to true by reverting the change 1178534: dse-k8s: disable dse-k8s-codfw bootstrap | https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178534