Page MenuHomePhabricator

Decommission druid100[7-8].eqiad.wmnet
Closed, ResolvedPublic

Description

New hosts druid101[2-3] have been onboarded to the cluster, we can now proceed with the decommissioning. T397441
The steps for this as per[[ https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Druid#Removing_hosts/_taking_hosts_out_of_service_from_cluster | Removing hosts/ taking hosts out of service from cluster]] are,

  • Change the superset Druid public (AQS) connection string.

Depool the hosts.

    • druid1007
    • druid1008
  • Using the coordinator web interface set nodes into decommissioningNodes mode. Once the historical disk cache is drained, the middlemanager is not running any jobs, and the overlord is not targeted by any scheduled jobs, it is safe to stop the services.
    • druid1007
    • druid1008
  • Remove the hosts from LVS
    • druid1007
    • druid1008
  • Remove the hosts from druid_public_hosts:
    • druid1007
    • druid1008
  • Decommission hosts (hand over to dc ops T403801
    • druid1007
    • druid1008
  • Remove mention of hosts from site.pp
    • druid1007
    • druid1008
  • Remove keytabs and dummy-keytabs
    • druid1007
    • druid1008

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-analytics) [2025-09-24T07:21:16Z] <stevemunene> change the Druid public (AQS) connection string to druid1011 as we decommission druid1007 T405446

Druid hosts are done decommissioning, next is removing them from LVS

image.png (432×1 px, 77 KB)

Depooled and removed the hosts from LVS

stevemunene@puppetserver1001:~$ sudo confctl select 'service=(druid-public-broker),name=druid1007.eqiad.wmnet' set/pooled=no
eqiad/druid-public/druid-public-broker/druid1007.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1007.eqiad.wmnet
stevemunene@puppetserver1001:~$ sudo confctl select 'service=(druid-public-broker),name=druid1008.eqiad.wmnet' set/pooled=no
eqiad/druid-public/druid-public-broker/druid1008.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: service=(druid-public-broker),name=druid1008.eqiad.wmnet
stevemunene@puppetserver1001:~$ sudo confctl select dc=eqiad,cluster=druid-public,service=druid-public-broker get
{"druid1009.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1012.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1013.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1007.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1008.eqiad.wmnet": {"weight": 10, "pooled": "no"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

stevemunene@puppetserver1001:~$ sudo confctl select dc=eqiad,cluster=druid-public,service=druid-public-broker get
{"druid1009.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1011.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1012.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}
{"druid1013.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

Icinga downtime and Alertmanager silence (ID=ef39a3c3-8f18-4c7a-a254-bfa590b603bc) set by stevemunene@cumin1003 for 2 days, 0:00:00 on 2 host(s) and their services with reason: Decommissioning druid_public hosts

druid[1007-1008].eqiad.wmnet

@Stevemunene is blocked on access keys, someone else has to run the decom cookbook.

Icinga downtime and Alertmanager silence (ID=42df5935-dfe8-456f-a594-adc05d19a266) set by stevemunene@cumin1003 for 2 days, 0:00:00 on 2 host(s) and their services with reason: Decommissioning druid_public hosts

druid[1007-1008].eqiad.wmnet

removed the keytabs

stevemunene@krb1002:~$ sudo manage_principals.py list *druid1007*
druid/an-druid1007.eqiad.wmnet@WIKIMEDIA
druid/druid1007.eqiad.wmnet@WIKIMEDIA
stevemunene@krb1002:~$ sudo manage_principals.py list *druid1008*
druid/druid1008.eqiad.wmnet@WIKIMEDIA
stevemunene@krb1002:~$ sudo manage_principals.py delete druid/druid1007.eqiad.wmnet@WIKIMEDIA
Principal successfully deleted.
stevemunene@krb1002:~$ sudo manage_principals.py delete druid/druid1008.eqiad.wmnet@WIKIMEDIA
Principal successfully deleted.
stevemunene@krb1002:~$ sudo manage_principals.py list *druid1008*
stevemunene@krb1002:~$ sudo manage_principals.py list *druid1007*
druid/an-druid1007.eqiad.wmnet@WIKIMEDIA