Page MenuHomePhabricator

Decommission druid100[4-6]
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
May 5 2023, 10:03 AM
Referenced Files
F41594054: image.png
Dec 11 2023, 9:01 AM
F41562197: image.png
Dec 5 2023, 8:41 AM
F41548275: image.png
Nov 30 2023, 9:29 AM
F41504653: image.png
Nov 14 2023, 12:15 PM

Description

The druid100[4-6] servers are due to be decommissioned.

T336042: Bring druid10[09-11] into service has been completed we shall now proceed with the decommissioning of the druid100[4-6] hosts. steps as per Removing hosts/ taking hosts out of service from cluster as below:

  • Change the druid-public endpoint used by services namely
  • Depool the hosts.
    • druid1004
    • druid1005
    • druid1006
  • Using the coordinator web interface set nodes into decommissioningNodes mode. Once the historical disk cache is drained, the middlemanager is not running any jobs, and the overlord is not targeted by any scheduled jobs, it is safe to stop the services.
    • druid1004
    • druid1005
    • druid1006
  • Remove the hosts from LVS
    • druid1004
    • druid1005
    • druid1006
  • Remove the hosts from druid_public_hosts:
    • druid1004
    • druid1005
    • druid1006
  • Decommission hosts
    • druid1004
    • druid1005
    • druid1006
  • Remove mention of hosts from site.pp
    • druid1004
    • druid1005
    • druid1006
  • Remove keytabs and dummy-keytabs
    • druid1004
    • druid1005
    • druid1006

Event Timeline

The steps look good to me Steve. There's a little duplication because you have said both of:

Drain the middlemanagers

and

Set nodes into decommissioningNodes mode

You can see from here that in fact these are two options to achieving the same thing.

What I would do is use the web interface and set all three nodes into decommissioning mode at once like this.

I would also look at preparing the patch to remove the three hosts from LVS ahead of time, so that we have plenty of time to review it. Once again, this will involve a manual restart of pybal, so I would seek a review from the traffic team and check when is a good time to apply it.

I would also look at preparing the patch to remove the three hosts from LVS ahead of time, so that we have plenty of time to review it. Once again, this will involve a manual restart of pybal, so I would seek a review from the traffic team and check when is a good time to apply it.

Ack, thanks @BTullis

The hosts druid100[4-6] have been depooled and have been added to the decommissioningNodes mode.

stevemunene@druid1004:~$ sudo decommission
Decommissioning all services on druid1004.eqiad.wmnet
eqiad/druid-public/druid-public-broker/druid1004.eqiad.wmnet: pooled changed yes => inactive

image.png (840×1 px, 109 KB)

We encountered some challenges with the decommissioning. The newer servers druid10[09-11] due to an issue with the RAID config (which has been resolved) have a smaller /srv partition at 1.3T as opposed to 2.7T. The set segmentCache max is 2.5T, which has resulted in the newer servers filling up the /srv partition due to the difference in available space vs allocated max.
Comparison between druid1007 and druid1009

stevemunene@druid1007:~$ lsblk -i
NAME           MAJ:MIN RM  SIZE RO TYPE   MOUNTPOINT
sda              8:0    0  1.8T  0 disk   
|-sda1           8:1    0  285M  0 part   
`-sda2           8:2    0  1.8T  0 part   
  `-md0          9:0    0  3.5T  0 raid10 
    |-vg0-root 253:0    0 74.5G  0 lvm    /
    |-vg0-swap 253:1    0  976M  0 lvm    [SWAP]
    `-vg0-srv  253:2    0  2.7T  0 lvm    /srv
sdb              8:16   0  1.8T  0 disk   
|-sdb1           8:17   0  285M  0 part   
`-sdb2           8:18   0  1.8T  0 part   
  `-md0          9:0    0  3.5T  0 raid10 
    |-vg0-root 253:0    0 74.5G  0 lvm    /
    |-vg0-swap 253:1    0  976M  0 lvm    [SWAP]
    `-vg0-srv  253:2    0  2.7T  0 lvm    /srv
sdc              8:32   0  1.8T  0 disk   
|-sdc1           8:33   0  285M  0 part   
`-sdc2           8:34   0  1.8T  0 part   
  `-md0          9:0    0  3.5T  0 raid10 
    |-vg0-root 253:0    0 74.5G  0 lvm    /
    |-vg0-swap 253:1    0  976M  0 lvm    [SWAP]
    `-vg0-srv  253:2    0  2.7T  0 lvm    /srv
sdd              8:48   0  1.8T  0 disk   
|-sdd1           8:49   0  285M  0 part   
`-sdd2           8:50   0  1.8T  0 part   
  `-md0          9:0    0  3.5T  0 raid10 
    |-vg0-root 253:0    0 74.5G  0 lvm    /
    |-vg0-swap 253:1    0  976M  0 lvm    [SWAP]
    `-vg0-srv  253:2    0  2.7T  0 lvm    /srv
stevemunene@druid1009:~$ lsblk -i
NAME           MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda              8:0    0 894.3G  0 disk   
|-sda1           8:1    0   285M  0 part   
`-sda2           8:2    0   894G  0 part   
  `-md0          9:0    0   1.7T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   1.3T  0 lvm    /srv
sdb              8:16   0 894.3G  0 disk   
|-sdb1           8:17   0   285M  0 part   
`-sdb2           8:18   0   894G  0 part   
  `-md0          9:0    0   1.7T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   1.3T  0 lvm    /srv
sdc              8:32   0 894.3G  0 disk   
|-sdc1           8:33   0   285M  0 part   
`-sdc2           8:34   0   894G  0 part   
  `-md0          9:0    0   1.7T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   1.3T  0 lvm    /srv
sdd              8:48   0 894.3G  0 disk   
|-sdd1           8:49   0   285M  0 part   
`-sdd2           8:50   0   894G  0 part   
  `-md0          9:0    0   1.7T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   1.3T  0 lvm    /srv
sde              8:64   0 894.3G  0 disk   
sdf              8:80   0 894.3G  0 disk   
sdg              8:96   0 894.3G  0 disk   
sdh              8:112  0 894.3G  0 disk

This means that we shall need to re-image the druid10[09-11] hosts so that they pick up the right config alongside the hosts that require an upgrade. This shall begin once the rebalance of the druid hosts after the removal of druid100[4-6] is done

Change 975248 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] switch druid host to run data_purge job

https://gerrit.wikimedia.org/r/975248

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye completed:

  • druid1009 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"druid1009.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Add puppet_version metadata to Debian installer
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311201126_stevemunene_28358_druid1009.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1009\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

  • Updated Netbox data from PuppetDB
  • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Reimaged druid1009 and the host is up with the right partitions and rebalancing

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye

Change 975248 merged by Stevemunene:

[operations/puppet@production] switch druid host to run data_purge job

https://gerrit.wikimedia.org/r/975248

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1011.eqiad.wmnet with OS bullseye completed:

  • druid1011 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"druid1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Add puppet_version metadata to Debian installer
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311210752_stevemunene_585193_druid1011.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is not optimal, downtime not removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1011\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

  • Updated Netbox data from PuppetDB
  • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

druid1011 is also up with the right partitions, rebalanced and repooled. Moving on to the OS upgrade of druid100[7-8] as switch druid host to index to the druid-public cluster and datahub injestion. is deployed with today's weekly train.

Mentioned in SAL (#wikimedia-analytics) [2023-11-21T15:04:28Z] <stevemunene> pool druid1011 after reimage T336043

stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/544

switch druid host to index to the druid-public cluster and datahub injestion.

Mentioned in SAL (#wikimedia-analytics) [2023-11-30T08:28:21Z] <stevemunene> reimage druid1010 to pick up the right raid config and corresponding partman recipe T336043

druid10[09-11] now have all been reimaged with the right raid config and we can proceed with the decommission of druid100[4-6] once druid1010 is fully back in the cluster and balanced.

stevemunene@druid1010:~$ lsblk -i
NAME           MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda              8:0    0 894.3G  0 disk   
|-sda1           8:1    0   285M  0 part   
`-sda2           8:2    0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdb              8:16   0 894.3G  0 disk   
|-sdb1           8:17   0   285M  0 part   
`-sdb2           8:18   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdc              8:32   0 894.3G  0 disk   
|-sdc1           8:33   0   285M  0 part   
`-sdc2           8:34   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdd              8:48   0 894.3G  0 disk   
|-sdd1           8:49   0   285M  0 part   
`-sdd2           8:50   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sde              8:64   0 894.3G  0 disk   
|-sde1           8:65   0   285M  0 part   
`-sde2           8:66   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdf              8:80   0 894.3G  0 disk   
|-sdf1           8:81   0   285M  0 part   
`-sdf2           8:82   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdg              8:96   0 894.3G  0 disk   
|-sdg1           8:97   0   285M  0 part   
`-sdg2           8:98   0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv
sdh              8:112  0 894.3G  0 disk   
|-sdh1           8:113  0   285M  0 part   
`-sdh2           8:114  0   894G  0 part   
  `-md0          9:0    0   3.5T  0 raid10 
    |-vg0-swap 253:0    0   976M  0 lvm    [SWAP]
    |-vg0-root 253:1    0  74.5G  0 lvm    /
    `-vg0-srv  253:2    0   2.7T  0 lvm    /srv

image.png (726×3 px, 339 KB)

Mentioned in SAL (#wikimedia-analytics) [2023-12-01T11:13:23Z] <stevemunene> pool druid1010 after reimage T336043

Mentioned in SAL (#wikimedia-analytics) [2023-12-05T08:29:40Z] <stevemunene> depool druid10[04-06] T336043

The hosts druid100[4-6] have been depooled and set into decommissioning mode

image.png (726×1 px, 125 KB)

Druid100[4-6] are now fully drained and we can proceed with the next steps on the decommission process

image.png (434×2 px, 152 KB)

Mentioned in SAL (#wikimedia-operations) [2024-01-08T19:04:19Z] <sukhe> running authdns-update for CR 988684: T336043

[EDITED, wrong task ID for authdns-update].

Change 974120 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] druid: remove druid100[4-6] from druid_public_broker VIP

https://gerrit.wikimedia.org/r/974120

Change 974120 merged by Ssingh:

[operations/puppet@production] druid: remove druid100[4-6] from druid_public_broker VIP

https://gerrit.wikimedia.org/r/974120

Change 989461 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Remove puppet references for druid1004_6

https://gerrit.wikimedia.org/r/989461

Change 989461 merged by Stevemunene:

[operations/puppet@production] Remove puppet references for druid1004_6

https://gerrit.wikimedia.org/r/989461

Gehel triaged this task as Medium priority.Jan 22 2024, 2:57 PM

Change 992968 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Remove dummy-keytabs for decommissioned druid hosts

https://gerrit.wikimedia.org/r/992968

Change 992968 merged by Stevemunene:

[labs/private@master] Remove dummy-keytabs for decommissioned druid hosts

https://gerrit.wikimedia.org/r/992968

All SRE steps have been completed and the hosts have been decommissioned and handed over to dc ops for the final step.

Thanks for the decommissions @Stevemunene - I just added the ops-eqiad tag to the subtasks to help make sure that they are seen by the right team.