Page MenuHomePhabricator

Increase the size of the Druid broker cache size from 2GB to 4GB
Closed, ResolvedPublic

Description

We have had a request from @CDanis to increase the size of the Druid broker's cache.

The current value of 2GB was determined in 2017 (T176223), when the original servers had 128 GB of RAM. (c.f. T128807, I believe)

Now, we have two clusters, each with 5 servers with 128 GB of RAM each.

We can see from these graphs that the cache is full most of the time and reaches full capacity a few days after being being cleared, presumably after a restart of each broker process.

@JAllemandou commented to say:

I assume we could grow the cache size, but this would reduce the amount of memory available for page-cache, that druid uses heavily.
I'm not sure what a good value should be here. We could try to bump the cache-size to 4Gb and see (the total amount of memory available on the host is 128Gb, using 2Gb more shouldn't be that bad)

image.png (965×1 px, 127 KB)

image.png (954×1 px, 126 KB)

Event Timeline

Change #1199280 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] druid: Increase the size of the Druid broker cache size to 4GB

https://gerrit.wikimedia.org/r/1199280

Change #1199280 merged by Stevemunene:

[operations/puppet@production] druid: Increase the size of the Druid broker cache size to 4GB

https://gerrit.wikimedia.org/r/1199280

Mentioned in SAL (#wikimedia-analytics) [2025-10-29T12:14:55Z] <stevemunene> roll restart druid worker hosts for T408189

Ran puppet on the hosts then restarted the druid daemons

stevemunene@cumin1003:~$ sudo cumin 'an-druid1*.eqiad.wmnet' 'run-puppet-agent -q'
5 hosts will be targeted:
an-druid[1003-1007].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NO OUTPUT =====                                                                                                                                                                                                                                                                                      
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [01:20<00:00, 16.02s/hosts]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [01:20<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent -q'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1003:~$ sudo cumin 'druid1*.eqiad.wmnet' 'run-puppet-agent -q'
5 hosts will be targeted:
druid[1009-1013].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NO OUTPUT =====                                                                                                                                                                                                                                                                                      
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [01:12<00:00, 14.56s/hosts]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [01:12<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'run-puppet-agent -q'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

analytics

stevemunene@cumin1003:~$ sudo cookbook sre.druid.roll-restart-workers analytics
Acquired lock for key /spicerack/locks/cookbooks/sre.druid.roll-restart-workers: {'concurrency': 20, 'created': '2025-10-29 12:17:20.920308', 'owner': 'stevemunene@cumin1003 [754588]', 'ttl': 1800}
START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: an-druid[1003-1007]
[1/12, retrying in 10.00s] Unable to verify all hosts got downtimed: Some hosts are not yet downtimed: ['an-druid1007']
Created silence ID 3f3c7aa7-fbc9-4593-883f-6ffbcd90fdec
Restarting daemons (historical,overlord,middlemanager,broker,coordinator), one host at the time.
===== NO OUTPUT =====                                                                                                                                                                                                                                                                                      
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [45:59<00:00, 551.94s/hosts]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [45:59<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restarting the Prometheus Druid exporters
===== NO OUTPUT =====                                                                                                                                                                                                                                                                                      
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:00<00:00,  6.75hosts/s]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [00:00<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'systemctl restar...s-druid-exporter'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted silence ID 3f3c7aa7-fbc9-4593-883f-6ffbcd90fdec
All Druid jvm restarts completed!
Released lock for key /spicerack/locks/cookbooks/sre.druid.roll-restart-workers: {'concurrency': 20, 'created': '2025-10-29 12:17:20.920308', 'owner': 'stevemunene@cumin1003 [754588]', 'ttl': 1800}
END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.

public

ll Druid jvm restarts completed!
Released lock for key /spicerack/locks/cookbooks/sre.druid.roll-restart-workers: {'concurrency': 20, 'created': '2025-10-29 12:17:20.920308', 'owner': 'stevemunene@cumin1003 [754588]', 'ttl': 1800}
END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
stevemunene@cumin1003:~$ sudo cookbook sre.druid.roll-restart-workers public
Acquired lock for key /spicerack/locks/cookbooks/sre.druid.roll-restart-workers: {'concurrency': 20, 'created': '2025-10-29 13:05:33.307919', 'owner': 'stevemunene@cumin1003 [760226]', 'ttl': 1800}
START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons.
Scheduling downtime on Icinga server alert1002.wikimedia.org for hosts: druid[1009-1013]
Created silence ID c1751a61-26ad-4cd7-9606-1e3da5a4e12a
Restarting daemons (historical,overlord,middlemanager,broker,coordinator), one host at the time.
PASS |                                                                                                                                                                                                                                                                     |   0% (0/5) [00:00<?, ?hosts/s]===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1013.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Pooling all services on druid1013.eqiad.wmnet                                                                                                                                                                                                                                                              
eqiad/druid-public/druid-public-broker/druid1013.eqiad.wmnet: pooled changed no => yes                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1013.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Depooling all services on druid1013.eqiad.wmnet                                                                                                                                                                                                                                                            
eqiad/druid-public/druid-public-broker/druid1013.eqiad.wmnet: pooled changed yes => no                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1012.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Pooling all services on druid1012.eqiad.wmnet                                                                                                                                                                                                                                                              
eqiad/druid-public/druid-public-broker/druid1012.eqiad.wmnet: pooled changed no => yes                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1012.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Depooling all services on druid1012.eqiad.wmnet                                                                                                                                                                                                                                                            
eqiad/druid-public/druid-public-broker/druid1012.eqiad.wmnet: pooled changed yes => no                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1011.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Pooling all services on druid1011.eqiad.wmnet                                                                                                                                                                                                                                                              
eqiad/druid-public/druid-public-broker/druid1011.eqiad.wmnet: pooled changed no => yes                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1011.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Depooling all services on druid1011.eqiad.wmnet                                                                                                                                                                                                                                                            
eqiad/druid-public/druid-public-broker/druid1011.eqiad.wmnet: pooled changed yes => no                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1010.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Pooling all services on druid1010.eqiad.wmnet                                                                                                                                                                                                                                                              
eqiad/druid-public/druid-public-broker/druid1010.eqiad.wmnet: pooled changed no => yes                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1010.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Depooling all services on druid1010.eqiad.wmnet                                                                                                                                                                                                                                                            
eqiad/druid-public/druid-public-broker/druid1010.eqiad.wmnet: pooled changed yes => no                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1009.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Pooling all services on druid1009.eqiad.wmnet                                                                                                                                                                                                                                                              
eqiad/druid-public/druid-public-broker/druid1009.eqiad.wmnet: pooled changed no => yes                                                                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                                                                                                                                     
(1) druid1009.eqiad.wmnet                                                                                                                                                                                                                                                                                  
----- OUTPUT -----                                                                                                                                                                                                                                                                                         
Depooling all services on druid1009.eqiad.wmnet                                                                                                                                                                                                                                                            
eqiad/druid-public/druid-public-broker/druid1009.eqiad.wmnet: pooled changed yes => no                                                                                                                                                                                                                     
================                                                                                                                                                                                                                                                                                           
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [50:52<00:00, 610.41s/hosts]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [50:52<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restarting the Prometheus Druid exporters
===== NO OUTPUT =====                                                                                                                                                                                                                                                                                      
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (5/5) [00:00<00:00,  8.17hosts/s]
FAIL |                                                                                                                                                                                                                                                                     |   0% (0/5) [00:00<?, ?hosts/s]
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'systemctl restar...s-druid-exporter'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Deleted silence ID c1751a61-26ad-4cd7-9606-1e3da5a4e12a
All Druid jvm restarts completed!
Released lock for key /spicerack/locks/cookbooks/sre.druid.roll-restart-workers: {'concurrency': 20, 'created': '2025-10-29 13:05:33.307919', 'owner': 'stevemunene@cumin1003 [760226]', 'ttl': 1800}
END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons.