Page MenuHomePhabricator

Add new Druid nodes to analytics and public clusters
Closed, ResolvedPublic

Description

In T245569 we get 4 new druid nodes:

  • an-druid100[1,2] - analytics VLAN, will be added to the Analytics cluster
  • druid100[7,8] - private VLAN, will be added to the Public cluster

Caveat: https://phabricator.wikimedia.org/T245810

/var/lib/druid is now moved under /srv/druid, so we should modify puppet accordingly. There might also be the need for hosts to co-exist with two configs (maybe adding a symlink to keep sanity/compatibility).

Event Timeline

On druid1001:

/dev/mapper/druid1001--vg-druid  2.9T  1.2T  1.6T  43% /var/lib/druid

A clean solution could be this - for each node:

  • disable puppet
  • we stop all daemons (could be coupled with the restarts needed for the new openjdk-8)
  • unmount /var/lib/druid and mount it under /srv (currently empty)
  • run puppet with the new hiera config (using /srv/)

Change 596633 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: move config to /srv

https://gerrit.wikimedia.org/r/596633

Change 596633 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: move config to /srv

https://gerrit.wikimedia.org/r/596633

druid100[1-3] have been ported to the new scheme, all good except the following bit that I didn't know: all druid daemons are executed with -Djava.io.tmpdir=/var/lib/druid/tmp, and I don't know where it get set. I added a symlink to /srv/druid/tmp as temp workaround, but it needs to be fixed otherwise indexations fail :(

Change 596662 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: set java.io.tmpdir=/srv/druid/tmp

https://gerrit.wikimedia.org/r/596662

Change 596662 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: set java.io.tmpdir=/srv/druid/tmp

https://gerrit.wikimedia.org/r/596662

Change 596678 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add role::druid::analytics::worker to an-druid100[1,2]

https://gerrit.wikimedia.org/r/596678

Change 596678 merged by Elukey:
[operations/puppet@production] Add role::druid::analytics::worker to an-druid100[1,2]

https://gerrit.wikimedia.org/r/596678

The new hosts have been added to the Analytics cluster, some notes:

  • after running puppet on an-druid* both openjdk 11 and 8 were deployed, with 11 selected by alternatives and Druid was not able to run. I removed it manually, but then I wondered about zookeeper.
  • zookeeper is not running on the new nodes, but it will when druid100[1-3] will be reimaged. In that case we'll need openjdk 11 and 8, and the zookeeper cluster will also migrate to a new version. It shouldn't be super hard but we should do it carefully.
  • druid100[1-3] are running with /srv/druid as partition target, meanwhile an-druid100[1,2] with /srv only. This is a minor difference mostly due to T252771#6136887. Not a big deal in my opinion but worth to being raised for awareness.
  • the new nodes are beefier, so we should probably have some hiera parametrization related to cpu/memory.

Change 597573 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::analytics::worker: tune historical daemon settings

https://gerrit.wikimedia.org/r/597573

Change 597573 merged by Elukey:
[operations/puppet@production] profile::druid::analytics::worker: tune historical daemon settings

https://gerrit.wikimedia.org/r/597573

Had to revert, this error popped up:

May 20 17:12:36 an-druid1001 druid[19141]: 12) Not enough direct memory.  Please adjust -XX:MaxDirectMemorySize, druid.processing.buffer.sizeBytes, druid.processing.numThreads, or druid.processing.numMergeBuffers: maxDirectMemory[12,884,901,888], memoryNeeded[20,132,659,200] = druid.processing.buffer.sizeBytes[268,435,456] * (druid.processing.numMergeBuffers[10] + druid.processing.numThreads[64] + 1)

So the idea is now to pre-calculate this as well in the historical profile puppet code, if druid.processing.numThreads is not provided via hiera.

Change 597737 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: add autoconfig for historical

https://gerrit.wikimedia.org/r/597737

Change 597737 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: add autoconfig for historical

https://gerrit.wikimedia.org/r/597737

Change 597747 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] turnilo: move broker config to an-druid1001

https://gerrit.wikimedia.org/r/597747

Change 597747 merged by Elukey:
[operations/puppet@production] turnilo: move broker config to an-druid1001

https://gerrit.wikimedia.org/r/597747

Change 597764 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: update historical's settings

https://gerrit.wikimedia.org/r/597764

Change 597764 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: update historical's settings

https://gerrit.wikimedia.org/r/597764

Mentioned in SAL (#wikimedia-operations) [2020-05-21T12:29:04Z] <elukey> roll restart druid-public cluster (druid100[4-6], backend for the AQS API) to apply new settings + openjdk upgrade - T252771

Mentioned in SAL (#wikimedia-analytics) [2020-05-21T13:13:14Z] <elukey> stop druid-daemons on druid100[1-3] (one at the time) to move the druid partition from /srv/druid to /srv (didn't think about it before) - T252771

Change 597786 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: increase historical thread pool

https://gerrit.wikimedia.org/r/597786

Change 597786 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: increase historical thread pool

https://gerrit.wikimedia.org/r/597786

Change 597801 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add AAAA records for an-druid100[1,2] and stat1008

https://gerrit.wikimedia.org/r/597801

Change 597801 merged by Elukey:
[operations/dns@master] Add AAAA records for an-druid100[1,2] and stat1008

https://gerrit.wikimedia.org/r/597801

Change 597817 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add PTR/AAAA records for druid100[7,8]

https://gerrit.wikimedia.org/r/597817

Change 597817 merged by Elukey:
[operations/dns@master] Add PTR/AAAA records for druid100[7,8]

https://gerrit.wikimedia.org/r/597817

Change 597821 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add druid100[7,8] to the druid_public_hosts network/firewall range

https://gerrit.wikimedia.org/r/597821

Change 597821 merged by Elukey:
[operations/puppet@production] Add druid100[7,8] to the druid_public_hosts network/firewall range

https://gerrit.wikimedia.org/r/597821

Mentioned in SAL (#wikimedia-analytics) [2020-05-21T16:44:34Z] <elukey> roll restart druid historical nodes on druid100[4-6] (public cluster) to pick up new settings - T252771

Change 597833 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role::druid::public::worker to druid100[7,8]

https://gerrit.wikimedia.org/r/597833

Change 597833 merged by Elukey:
[operations/puppet@production] Assign role::druid::public::worker to druid100[7,8]

https://gerrit.wikimedia.org/r/597833

Mentioned in SAL (#wikimedia-analytics) [2020-05-21T17:24:30Z] <elukey> add druid100[7,8] to the druid public cluster (not serving load balancer traffic for the moment, only joining the cluster) - T252771

Change 597918 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add druid100[7,8] to the druid_public_broker VIP

https://gerrit.wikimedia.org/r/597918

Change 597919 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: update druid analytics monitor

https://gerrit.wikimedia.org/r/597919

Change 597919 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: update druid analytics monitor

https://gerrit.wikimedia.org/r/597919

Change 597918 merged by Elukey:
[operations/puppet@production] Add druid100[7,8] to the druid_public_broker VIP

https://gerrit.wikimedia.org/r/597918

Change 597990 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: improve druid alerts

https://gerrit.wikimedia.org/r/597990

Change 597990 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: improve druid alerts

https://gerrit.wikimedia.org/r/597990

elukey set Final Story Points to 13.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
elukey triaged this task as High priority.May 26 2020, 7:35 AM
elukey moved this task from Incoming to Operational Excellence on the Analytics board.