Page MenuHomePhabricator

Refresh Druid nodes (druid100[1-3])
Closed, ResolvedPublic

Description

This task is blocked until the related rack/setup/deploy is completed.

Details

ProjectBranchLines +/-Subject
operations/container/miscwebmaster+4 -228
operations/container/miscwebmaster+12 -12
labs/privatemaster+0 -0
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+3 -1
operations/puppetproduction+0 -5
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+6 -3
operations/puppetproduction+0 -3
operations/puppetproduction+1 -4
analytics/refinerymaster+23 -23
operations/puppetproduction+2 -12
operations/puppetproduction+5 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+8 -1
operations/puppetproduction+10 -6
labs/privatemaster+0 -0
operations/puppetproduction+2 -2
operations/puppetproduction+5 -5
operations/puppetproduction+2 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 710963 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-druid1004.equad.wmnet into service

https://gerrit.wikimedia.org/r/710963

Change 710961 merged by Btullis:

[operations/puppet@production] Create the druid user and group before installing druid-common

https://gerrit.wikimedia.org/r/710961

Change 710963 merged by Btullis:

[operations/puppet@production] Bring an-druid1004.equad.wmnet into service

https://gerrit.wikimedia.org/r/710963

The patch to create the user seems to have worked at first pass.

Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[druid]/ensure: created
Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/User[druid]/ensure: created
Notice: /Stage[main]/Druid/Package[druid-common]/ensure: created

Ah, no. It didn't work. They're owned by root, so the postinst script didn't set the ownership.

btullis@an-druid1004:~$ ls -ld /var/log/druid
drwxr-xr-x 2 root root 4096 Jul 23  2020 /var/log/druid
btullis@an-druid1004:~$ ls -ld /srv/druid
drwxr-xr-x 7 root root 4096 Aug  9 12:25 /srv/druid

Changed the ownership and restarted the services manually.

btullis@an-druid1004:~$ sudo chown -R druid:druid /var/log/druid /srv/druid
btullis@an-druid1004:~$ sudo systemctl restart druid-broker.service 
btullis@an-druid1004:~$ sudo systemctl restart druid-coordinator.service 
btullis@an-druid1004:~$ sudo systemctl restart druid-historical.service 
btullis@an-druid1004:~$ sudo systemctl restart druid-middlemanager.service 
btullis@an-druid1004:~$ sudo systemctl restart druid-overlord.service

The dashboard now shows an-druid1004 and it is successfully loading segments.

image.png (494×1 px, 118 KB)

Change 710979 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Bring an-druid1005.eqiad.wmnet into service

https://gerrit.wikimedia.org/r/710979

Change 710980 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Add dummy keytabs for new druid nodes

https://gerrit.wikimedia.org/r/710980

Change 710980 merged by Btullis:

[labs/private@master] Add dummy keytabs for new druid nodes

https://gerrit.wikimedia.org/r/710980

Change 710979 merged by Btullis:

[operations/puppet@production] Bring an-druid1005.eqiad.wmnet into service

https://gerrit.wikimedia.org/r/710979

Bringing an-druid1005 into service now, with the latest change to the installation of druid.

Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/Group[druid]/ensure: created
Notice: /Stage[main]/Druid::Bigtop::Hadoop::User/User[druid]/ensure: created
Notice: /Stage[main]/Druid/File[/var/log/druid]/ensure: created
Notice: /Stage[main]/Druid/File[/srv/druid]/ensure: created
Notice: /Stage[main]/Druid/Package[druid-common]/ensure: created

All looks to be OK with the segment rebalance.

image.png (486×1 px, 122 KB)

I will leave this overnight, then continue with the removal of the droid100[1-3] nodes tomorrow and the switchover of the zookeeper servers.

Hey Ben great work! I couple of things to remember for the decom:

  1. I am not sure if there is a way to force overlord/middlemanager to stop accepting indexation jobs, or if we have to simply wait that every host is not doing anything before decomming, but if a regular indexation is running when a decom happens some alerts will fire (and we'll need to re-run jobs etc.. not a big deal but I am just adding it in here :))
  2. Turnilo and Superset are configured to target a specific host's broker (manually set in their configs). For Turnilo the host is listed in puppet, for Superset it is listed in the database config of Druid Analytics. IIRC we should have used an-druid100[1,2] nodes, but let's double check, otherwise we'll break dashboard when decomming.
  3. We have a special indexation job that pulls continuously from kafka for Netflow data, indexing on-the-fly to provide more realtime data in Turnilo. The indexation can be killed and restarted (it follows the lambda architecture so batch jobs always override the dataset, in case of holes etc..). The specs to kick off the job are in refinery, under the druid directory IIRC. At the moment the indexation job may be running on the 3 nodes to decom (replicated 3 times, so it would completely fail only if we decom all nodes without restarting the job after). Netengs rely on this realtime data, I'd suggest to ping them in advance just in case. We also have an alarm for it too, be aware of it. The graph for realtime indexation is in the Druid dashboard if you want to check.
  4. We probably hardcode some druid nodes in both puppet and refinery, so before decomming nodes we'd need to double check what it is used. If nodes to decome are listed in refinery we'd need to patch the jobs with a deployment + restarting oozie coordinators before hand (otherwise indexations will fail etc..)

While checking the indexation failures I noticed:

2021-08-10T06:01:39,961 INFO org.apache.druid.indexing.overlord.ForkingTaskRunner: Exception caught during execution
java.io.FileNotFoundException: /srv/druid/indexing-logs/index_hadoop_event_navigationtiming_mggfonkc_2021-08-10T06:00:40.105Z.log (Permission denied)
        at java.io.FileOutputStream.open0(Native Method) ~[?:?]

elukey@an-druid1005:~$ ls -l /srv/druid/
total 20
drwxr-xr-x  2 root  root  4096 Jul 23  2020 deep-storage
drwxr-xr-x  2 root  root  4096 Jul 23  2020 indexing-logs
drwxr-xr-x 24 druid druid 4096 Aug  9 18:15 segment-cache
drwxr-xr-x  5 druid druid 4096 Aug 10 06:26 task
drwxr-xr-x  2 root  root  4096 Jul 23  2020 tmp

Some dirs are still owned by root, probably to be fixed as you did before :(

Change 711103 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the ownership of more druid directories

https://gerrit.wikimedia.org/r/711103

Thanks @elukey. I've addressed the ownership problem on an-druid1005 with:

btullis@an-druid1005:~$ sudo chown druid:druid /srv/druid/deep-storage /srv/druid/indexing-logs /srv/druid/tmp

I've also checked to make sure that these are the only files that I missed with the following check of the package definition.

btullis@an-druid1005:~$ egrep "srv/druid|var/log/druid" /var/lib/dpkg/info/druid-common.list
/srv/druid
/srv/druid/deep-storage
/srv/druid/indexing-logs
/srv/druid/tmp
/var/log/druid

I've now made a patch to fix this for future deployments: https://gerrit.wikimedia.org/r/c/operations/puppet/+/711103

Change 711103 merged by Btullis:

[operations/puppet@production] Fix the ownership of more druid directories

https://gerrit.wikimedia.org/r/711103

2 Turnilo and Superset are configured to target a specific host's broker (manually set in their configs). For Turnilo the host is listed in puppet, for Superset it is listed in the database config of Druid Analytics. IIRC we should have used an-druid100[1,2] nodes, but let's double check, otherwise we'll break dashboard when decomming.

Interesting, thanks. I'll check this out.

4 We probably hardcode some druid nodes in both puppet and refinery, so before decomming nodes we'd need to double check what it is used. If nodes to decom are listed in refinery we'd need to patch the jobs with a deployment + restarting oozie coordinators before hand (otherwise indexations will fail etc..)

Am I right in thinking that we hardcode the brokers' addresses because we haven't got access to a load-balancer system withinthe analytics VLAN?

Whereas the druid-public servers use a load-balancer with LVS and DNS service discovery, because the LVS servers can reach them directly. Is that right?

btullis@cumin1001:~$ host druid-public-broker.svc.eqiad.wmnet
druid-public-broker.svc.eqiad.wmnet has address 10.2.2.38

Am I right in thinking that we hardcode the brokers' addresses because we haven't got access to a load-balancer system withinthe analytics VLAN?

Exactly yes, the LVS hosts don't have any leg into the analytics vlans (we always consider a single VLAN but in reality we have 4, one for each row) so we cannot create LVS ips that point to analytics backend hosts.

Whereas the druid-public servers use a load-balancer with LVS and DNS service discovery, because the LVS servers can reach them directly. Is that right?

btullis@cumin1001:~$ host druid-public-broker.svc.eqiad.wmnet
druid-public-broker.svc.eqiad.wmnet has address 10.2.2.38

The Druid public cluster is not in the Analytics VLAN, we chose that road since we needed LVS in front of that cluster (since it is one of the backends of AQS), so we used the standard LVS config for it.

This looks like the best way to prevent new jobs being added to a middle-manager, prior to decommission: https://druid.apache.org/docs/latest/operations/rolling-updates.html#rolling-restart-graceful-termination-based

This is also referenced here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Safe_restart_of_MiddleManagers_when_running_Real_time_Indexing_jobs

At the moment, an-druid1001 is the active coordinator and overlord. Ascertained by running SSH tunnels as shown:

coordinator: ssh -N an-druid1001.eqiad.wmnet -L 8081:localhost:8081
overlord: ssh -N an-druid1001.eqiad.wmnet -L 8090:localhost:8090

...then opening http://localhost:8081 and http://localhost:8090 in a browser.

The wmf_netflow job is running on two of the refreshed nodes and one of the nodes to be decommissioned.

image.png (216×1 px, 54 KB)

So in theory it should be safe to do this decommissioning now, but I think that my preferred approach will be to begin the shuffle of zookeeper servers before decommissioning the droid100[1-3] servers.
I think that will likely be the safer option, rather than trying to do the two jobs at once.

Change 711120 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch one zookeeper node in the druid cluster

https://gerrit.wikimedia.org/r/711120

I had put the following process down for the zookeeper switch.

When deploying this change, we will need to:

  • Manually stop zookeeper on druid1001
  • Manually disable zookeeper on druid1001
  • Run puppet with this change on an-druid1001 to start zookeeper
  • Run puppet with this change on cumin1001
  • Roll restart the zookeepers in the druid cluster, so that they pick up the change
  • Verify that the three zookeeper clusters are functioning correctly
  • Repeat with a similar CR for each of the other two zookeeper servers

However, won't we also need to do a rolling restart of all druid processes after each switch as well, otherwise these will continue to try to talk to zookeeper servers that eventually go away.

So I could use a cookbook like this:

sudo cookbook sre.druid.roll-restart-workers --daemons historical,overlord,broker,coordinator analytics

This will exclude the middlemanagers, then I could use the disable API to drain each middlemanager, restart it, then re-enable it. Or re-run any jobs that get cancelled.
What's the impact of restarting the broker that has been hardcoded into the config for turnilo and superset? How quickly does it

As per comments from @elukey on the change request, I'll update the procedure.

  • Disable puppet on druid100[1-3] and an-druid100[1-3]
  • Manually stop zookeeper on druid1001
  • Manually disable zookeeper on druid1001
  • Enable and run puppet with this change on an-druid1001 to start zookeeper
  • Check zookeeper state by looking at logs to make sure that the ensemble of three nodes is up and running.
  • Enable and run puppet on druid1002 to obtain the configuration change
  • Restart the zookeeper service on druid1002 to apply the configuration change
  • Check zookeeper state by looking at logs to make sure that the ensemble of three nodes is up and running.
  • Enable and run puppet on druid1003 to obtain the configuration change
  • Restart the zookeeper service on druid1003 to apply the configuration change
  • Verify that the three zookeeper ensemble of three nodes is functioning correctly.
  • Roll restart the druid cluster, optionally disabling each middlemanager component before restarting

Repeat the procedure above twice in order to replace druid1002 and then druid1003.

I can use echo mntr | nc localhost 2181 and check for zk_synced_followers 2 to verify that the ensemble of three servers is healthy each time.

Other four-letter words may be useful too: https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html#sc_zkCommands

Updating the deployment plan again.

  • Disable puppet on druid100[1-3] and an-druid100[1-3] and an-launcher1002
  • Disable the following four timers on an-launcher1002
    • eventlogging_to_druid_editattemptstep_hourly.timer
    • eventlogging_to_druid_navigationtiming_hourly.timer
    • eventlogging_to_druid_netflow_hourly.timer
    • eventlogging_to_druid_prefupdate_hourly.timer
  • Disable the following three schedules in Hue
    • webrequest-druid-hourly-coord
    • pageview-druid-hourly-coord
    • edit-hourly-druid-coord
  • Manually stop zookeeper on druid1001
  • Manually disable zookeeper on druid1001
  • Enable and run puppet with this change on an-druid1001 to start zookeeper
  • Check zookeeper state with echo mntr | nc localhost 2181 to make sure that the ensemble of three nodes is up and running.
  • Enable and run puppet on druid1002 to obtain the configuration change
  • Restart the zookeeper service on druid1002 to apply the configuration change
  • Check zookeeper state with echo mntr | nc localhost 2181 to make sure that the ensemble of three nodes is up and running.
  • Enable and run puppet on druid1003 to obtain the configuration change
  • Restart the zookeeper service on druid1003 to apply the configuration change
  • Verify that the three zookeeper ensemble of three nodes is functioning correctly.
  • Wait until any in-progress druid ingestions have finished (except wmf_netflow)
  • Roll restart the druid cluster

Repeat the procedure above twice in order to replace druid1002 and then druid1003.

Once complete, re-enable systemd timers and hue jobs that were previously disabled.

Puppet disabled on all affected hosts.
Systemd timers disabled on an-launcher1002
Schedules disabled in Hue
Zookeeper stopped and disabled on druid1001

btullis@druid1001:~$ sudo systemctl stop zookeeper 
btullis@druid1001:~$ sudo systemctl disable zookeeper
zookeeper.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable zookeeper

Proceeding to merge and deploy the patch.

Change 711120 merged by Btullis:

[operations/puppet@production] Switch one zookeeper node in the druid cluster

https://gerrit.wikimedia.org/r/711120

We had some issue with an-druid1001 joining an existing ensemble. It might be OK as it is, but the advice we have found is that restarting the leader fixes the problem: https://issues.apache.org/jira/browse/ZOOKEEPER-2938

Proceeding with druid3 next, because the leader is druid1002 and we would like to do that last.

Change 711155 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the second zookeeper server in the druid cluster

https://gerrit.wikimedia.org/r/711155

Change 711155 merged by Btullis:

[operations/puppet@production] Switch the second zookeeper server in the druid cluster

https://gerrit.wikimedia.org/r/711155

btullis@druid1003:~$ sudo systemctl stop zookeeper
btullis@druid1003:~$ sudo systemctl disable zookeeper
zookeeper.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable zookeeper
btullis@an-druid1003:~$ sudo puppet agent --enable
btullis@an-druid1003:~$ sudo puppet agent -tv

Change 711161 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Roll back recent change to zookeeper

https://gerrit.wikimedia.org/r/711161

Change 711161 merged by Btullis:

[operations/puppet@production] Roll back recent change to zookeeper

https://gerrit.wikimedia.org/r/711161

We were blocked from running the sre.druid.roll-restart-workers cookbook by a bug, so we went ahead and re-enabled the timers and Hue jobs, given that Driud looks stable.

btullis@an-launcher1002:~$ sudo systemctl enable eventlogging_to_druid_editattemptstep_hourly.timer
Created symlink /etc/systemd/system/multi-user.target.wants/eventlogging_to_druid_editattemptstep_hourly.timer → /lib/systemd/system/eventlogging_to_druid_editattemptstep_hourly.timer.
btullis@an-launcher1002:~$ sudo systemctl enable eventlogging_to_druid_navigationtiming_hourly.timer
Created symlink /etc/systemd/system/multi-user.target.wants/eventlogging_to_druid_navigationtiming_hourly.timer → /lib/systemd/system/eventlogging_to_druid_navigationtiming_hourly.timer.
btullis@an-launcher1002:~$ sudo systemctl enable eventlogging_to_druid_netflow_hourly.timer
Created symlink /etc/systemd/system/multi-user.target.wants/eventlogging_to_druid_netflow_hourly.timer → /lib/systemd/system/eventlogging_to_druid_netflow_hourly.timer.
btullis@an-launcher1002:~$ sudo systemctl enable eventlogging_to_druid_prefupdate_hourly.timer
Created symlink /etc/systemd/system/multi-user.target.wants/eventlogging_to_druid_prefupdate_hourly.timer → /lib/systemd/system/eventlogging_to_druid_prefupdate_hourly.timer.

Continuing the work to complete this zookeeper migration. Currently druid1003 is the leader.
The plan is:

  • Stop relevant systemd timers and suspend relevant hue jobs again.
  • Disable puppet on an-launcher1002 to prevent the systemd timers being re-launched.
  • Carry out a rolling restart of druid, without the use of the cookbook, so that it picks up the intermediate zookeeper configuration.
  • Stop puppet again on all druid nodes
  • Prepare a patch to switch druid1002 to an-druid1002
  • Stop zookeeper on druid1002 and disable it
  • Enable and run puppet on an-druid1002 to start zookeeper and attempt to join the ensemble - check logs, ruok etc
  • Enable and run puppet on an-druid1001 to get the new config change. Restart zookeeper to allow it to change and recognize an-druid1002 as an ensemble member - check logs, ruok etc
  • Enable and run puppet on druid1003 to get the new config change. Restart zookeeper to allow it to change and recognize an-druid1002 as an ensemble member - check logs, ruok etc
  • This will force a re-election at this point.
  • Check for quorum and znodes presence etc with Grafana.

If all is well at this point, perform another rolling restart of druid to pick up the second intermediate zookeeper ensemble.

Then repeat with the third node.

Change 711458 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the second of the zookeeper nodes

https://gerrit.wikimedia.org/r/711458

Change 711458 merged by Btullis:

[operations/puppet@production] Switch the second of the zookeeper nodes

https://gerrit.wikimedia.org/r/711458

Change 711497 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Migrate the third zookeeper server in the druid cluster

https://gerrit.wikimedia.org/r/711497

Change 711497 merged by Btullis:

[operations/puppet@production] Migrate the third zookeeper server in the druid cluster

https://gerrit.wikimedia.org/r/711497

All three zookeeper servers have been migrated to an-druid100[1-3].
I have re-enabled the systemd timers and resumed the jobs in hue.
Now I can think about decommissioning the druid100[1-3] servers.

Change 711661 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Begin decommission of druid1003.eqiad.wmnet

https://gerrit.wikimedia.org/r/711661

I have disabled the middlemanager on druid1003 with the following command.

btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
{"druid1003.eqiad.wmnet:8091":"disabled"}

Verified that realtime jobs have moved to other middlemanagers.

Stopped and disabled druid services on druid1003.

sudo systemctl stop druid-broker && sudo systemctl disable druid-broker
sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator
sudo systemctl stop druid-historical && sudo systemctl disable druid-historical
sudo systemctl stop druid-middlemanager && sudo systemctl disable druid-middlemanager
sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord

Checking load queues on the other servers to see how long it takes to reallocate segments.

Interesting. It is clearly moving nodes away from druid1003 but slowly and I can't tell when it will finish.
I have discovered that I can use the "dynamic configuration API" to tell the coordinator to move segments away from the historical nodes before shutting them down.
https://druid.apache.org/docs/latest/configuration/index.html#dynamic-configuration

btullis@an-druid1001:~$ curl -s http://localhost:8081/druid/coordinator/v1/config | jq .
{
  "millisToWaitBeforeDeleting": 900000,
  "mergeBytesLimit": 524288000,
  "mergeSegmentsLimit": 100,
  "maxSegmentsToMove": 5,
  "replicantLifetime": 15,
  "replicationThrottleLimit": 10,
  "balancerComputeThreads": 1,
  "emitBalancingStats": false,
  "killDataSourceWhitelist": [],
  "killAllDataSources": false,
  "killPendingSegmentsSkipList": [],
  "maxSegmentsInNodeLoadingQueue": 0,
  "decommissioningNodes": [],
  "decommissioningMaxPercentOfMaxSegmentsToMove": 70,
  "pauseCoordination": false
}

I could use a curl command to configure decommissioningNodes but from their own docs...

It is recommended that you use the Coordinator Console to configure these parameters.

Therefore I think I will decommissionn druid1001.eqiad.wmnet and druid1002.eqiad.wmnet from the GUI.

image.png (883×1 px, 132 KB)

btullis@an-druid1001:~$ curl -s http://localhost:8081/druid/coordinator/v1/config|jq .
{
  "millisToWaitBeforeDeleting": 900000,
  "mergeBytesLimit": 524288000,
  "mergeSegmentsLimit": 100,
  "maxSegmentsToMove": 5,
  "replicantLifetime": 15,
  "replicationThrottleLimit": 10,
  "balancerComputeThreads": 1,
  "emitBalancingStats": false,
  "killDataSourceWhitelist": [],
  "killAllDataSources": false,
  "killPendingSegmentsSkipList": [],
  "maxSegmentsInNodeLoadingQueue": 0,
  "decommissioningNodes": [
    "druid1001.eqiad.wmnet:8083",
    "druid1002.eqiad.wmnet:8083"
  ],
  "decommissioningMaxPercentOfMaxSegmentsToMove": 70,
  "pauseCoordination": false
}

Seems to work a treat.

btullis@an-druid1001:/var/log/druid$ tail -f coordinator.log | grep BalanceSegments
2021-08-11T20:06:56,807 INFO org.apache.druid.server.coordinator.duty.BalanceSegments: Found 5 active servers, 2 decommissioning servers
2021-08-11T20:06:56,807 INFO org.apache.druid.server.coordinator.duty.BalanceSegments: Processing 4 segments for moving from decommissioning servers
2021-08-11T20:06:56,827 INFO org.apache.druid.server.coordinator.duty.BalanceSegments: Processing 1 segments for balancing between active servers
2021-08-11T20:06:56,832 INFO org.apache.druid.server.coordinator.duty.BalanceSegments: [_default_tier]: Segments Moved: [4] Segments Let Alone: [1]

Gracefully terminated the two remaining middlemanagers.

btullis@druid1001:~$ curl -X POST http://druid1001.eqiad.wmnet:8091/druid/worker/v1/disable && curl -X POST http://druid1002.eqiad.wmnet:8091/druid/worker/v1/disable

Seems to work a treat.

Really nice! Before closing can you update https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Removing_hosts/_taking_hosts_out_of_service_from_cluster ??

Please also don't decommission the Overlord on druid1001, it is referenced by a lot of jobs on Refinery, we need a patch + deployment + restart of oozie coordinators first :)

Change 712209 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Change preferred Druid coordinator URL

https://gerrit.wikimedia.org/r/712209

Change 711661 merged by Btullis:

[operations/puppet@production] Begin decommission of druid1003.eqiad.wmnet

https://gerrit.wikimedia.org/r/711661

Change 712315 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Decommission druid1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/712315

I notice that the three new hosts are still showing as staged in Netbox. Can I just set these to be Active manually, or is there another step for this?

image.png (585×330 px, 35 KB)

https://netbox.wikimedia.org/search/?q=druid&obj_type=

Manual change of state is ok!

Thanks. I've set them all to active now.

Change 712315 merged by Btullis:

[operations/puppet@production] Decommission druid1002.eqiad.wmnet

https://gerrit.wikimedia.org/r/712315

Change 712209 merged by Joal:

[analytics/refinery@master] Change preferred Druid coordinator URL

https://gerrit.wikimedia.org/r/712209

Change 714013 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Begin decommission of druid1001

https://gerrit.wikimedia.org/r/714013

Change 714013 merged by Btullis:

[operations/puppet@production] Begin decommission of druid1001

https://gerrit.wikimedia.org/r/714013

I have shut down the druid services on druid1001

btullis@druid1001:~$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord

Checking to see if everything is OK for a while before decommissioning the host.

Change 714015 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove references to old druid servers from site.pp

https://gerrit.wikimedia.org/r/714015

Change 714015 merged by Btullis:

[operations/puppet@production] Remove references to old druid servers from site.pp

https://gerrit.wikimedia.org/r/714015

Change 714022 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/deployment-charts@master] miscweb: override image, version name and set some CPU/RAM limits

https://gerrit.wikimedia.org/r/714022

Change 714023 had a related patch set uploaded (by Btullis; author: Btullis):

[labs/private@master] Remove dummy keytabs for decommissioned druid servers

https://gerrit.wikimedia.org/r/714023

Change 714024 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove a reference to druid1001 from DHCP

https://gerrit.wikimedia.org/r/714024

Change 714022 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: override image, version name and set some CPU/RAM limits

https://gerrit.wikimedia.org/r/714022

Change 714034 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/deployment-charts@master] miscweb: set service.deployment to production, not minikube, and port

https://gerrit.wikimedia.org/r/714034

Change 714053 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/deployment-charts@master] miscweb: define a dedicated nodePort, 4111

https://gerrit.wikimedia.org/r/714053

Change 714053 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: define a dedicated nodePort, 4111

https://gerrit.wikimedia.org/r/714053

Change 714024 merged by Btullis:

[operations/puppet@production] Remove a reference to druid1001 from DHCP

https://gerrit.wikimedia.org/r/714024

Change 714368 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/deployment-charts@master] miscweb: define a specific version tag for prod and staging

https://gerrit.wikimedia.org/r/714368

Change 714368 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: define a specific version tag for prod and staging

https://gerrit.wikimedia.org/r/714368

Change 714458 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/container/miscweb@master] replace separate httpd configs for stating/test with links to prod

https://gerrit.wikimedia.org/r/714458

Change 714459 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/container/miscweb@master] static-bugzilla: uncomment rewrite config line

https://gerrit.wikimedia.org/r/714459

Change 714458 merged by jenkins-bot:

[operations/container/miscweb@master] replace separate httpd configs for stating/test with links to prod

https://gerrit.wikimedia.org/r/714458

Change 714459 merged by jenkins-bot:

[operations/container/miscweb@master] static-bugzilla: uncomment rewrite config line

https://gerrit.wikimedia.org/r/714459

Change 714034 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: set service.deployment to production, not minikube

https://gerrit.wikimedia.org/r/714034

Change 714460 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/container/miscweb@master] static-bugzilla: add uncompressed HTML for the first 1000 bugs

https://gerrit.wikimedia.org/r/714460

sorry, I uploaded the changes above to this task by accident. They belong to T281538.

Change 714023 merged by Btullis:

[labs/private@master] Remove dummy keytabs for decommissioned druid servers

https://gerrit.wikimedia.org/r/714023