Page MenuHomePhabricator

Upgrade Druid to version 26.0.0
Open, HighPublic

Description

Update Jan 2026

We have settled on version 28.0.1 as our target version, for the time being.

This is because this is the last version that officially supports Hadoop 2.

Update: version 27.0.0 is actually the last version that can be built with Hadoop 2 support.
The required option was removed on 28.0.0

Update: version 27.0.0 exhibited some issues with Hadoop 2, even with the custom build profile. Therfore, we have now chosen version 26.0.0 as the target version, for now.

Original description follows.

https://github.com/apache/druid/issues/10462#20-result-caching is a bug fixed in 0.20 that we are seeing in our version, 0.19, namely the whole-query cache on brokers not populated. I was puzzled that Brokers were only registering cache miss, but it seems due to a problem in the code rather in the settings.

Having the Broker query cache working would probably benefit our latency, giving us the possibility to think about using memcached as common broker caching layer (even one host per host would be fine.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'll start work on this now.

Following the build guidelines here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/druid/+/refs/heads/debian/debian/README.Debian

Ah, but I have immediately run into an issue in that we use chose to use version 28.0.1 (T409411#11362422) based on its support for Hadoop 2.
However, it looks like version 27.0 was the last version to include the dist-hadoop2 option at all.

Support for building Druid against Hadoop 2 was formall dropped in 28.0.0: https://github.com/apache/druid/pull/14763
Even if we want to use 27.0.0, we can't use our existing method of packaging it, because we used the released binaries with gbp buildpackage

So we would either have to:

  1. Build version 27.0.0 from source and then package it

or

  1. Use version 26.0.0 and continue to use our existing packaging mechanism

From T409411#11348854 I think that 26.0 should be good enough, at least as first easy-step. It is a bummer that openjdk 17 support is added in 27.0, but we can use 11 on Bookworm. So yeah I'd be in favor of going to 26 as first step, and then think about 27 if really needed (and possibly a higher version with the new hardware + ceph). What do you think Ben?

It is a bummer that openjdk 17 support is added in 27.0, but we can use 11 on Bookworm

We haven't successfully even used our Bigtop/Hadoop packages on Bookworm yet, so an-druid100[3-7] are still running Bullseye, for now.
There is some progress on this, as we have created bigtop15 packages for bookworm, but there are still some lingering dependency problems around hive and python2.7 that we haven't yet solved.

So yeah I'd be in favor of going to 26 as first step, and then think about 27 if really needed (and possibly a higher version with the new hardware + ceph). What do you think Ben?

I'm trying a quick mvn clean install -Pdist-hadoop2 on version 27.0.0 and I'll see if this works.
This should generate a tarball that would be compatible with our gbp buildpackage approach for packaging, if we're lucky.

But yes, it's looking increasingly like a new cluster with the latest version and the use of S3 for deep storage is going to be the better long-term option.

BTullis renamed this task from Upgrade Druid to version 28.0.1 to Upgrade Druid to version 27.0.0.Jan 9 2026, 4:48 PM
BTullis updated the task description. (Show Details)

My test build of 27.0.0 has successed. The only modification that I needed to make was to update the following requirement in the pom.xml file from 2.7.5 to a more recent version. I chose 2.9.1 as the most recent release.

<plugin>
    <groupId>org.cyclonedx</groupId>
    <artifactId>cyclonedx-maven-plugin</artifactId>
    <version>2.9.1</version>
</plugin>

Here are the distribution files.

btullis@barracuda:~/wmf/druid-new/apache-druid-27.0.0-src$ ls -lh distribution/target/apache-druid-27.0.0-bin.tar.gz*
-rw-rw-r-- 1 btullis btullis 403M Jan  9 16:36 distribution/target/apache-druid-27.0.0-bin.tar.gz
-rw-rw-r-- 1 btullis btullis  128 Jan  9 16:38 distribution/target/apache-druid-27.0.0-bin.tar.gz.sha512

I will try creating a debian package from this tarball, using the existing gbp buildpackage method.

Icinga downtime and Alertmanager silence (ID=5a6ff076-1f79-480f-aeb9-c1989a8c389e) set by btullis@cumin1003 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Testing druid upgrade

an-test-druid1001.eqiad.wmnet

I have built the druid packages from the distribution tarball that I created of version 27.0.0

I did this on my workstation.

Added the new tarball to the repository

(base) btullis@barracuda:~/wmf/druid$ git checkout debian

(base) btullis@barracuda:~/wmf/druid$ export DRUID_VERSION=27.0.0

(base) btullis@barracuda:~/wmf/druid$ gbp import-orig -u $DRUID_VERSION --upstream-branch=master --debian-branch=debian ../druid-new/apache-druid-27.0.0-src/distribution/target/apache-druid-27.0.0-bin.tar.gz
gbp:info: Importing '../druid-new/apache-druid-27.0.0-src/distribution/target/apache-druid-27.0.0-bin.tar.gz' to branch 'master'...
gbp:info: Source package is druid
gbp:info: Upstream version is 27.0.0
gbp:info: Replacing upstream source on 'debian'
gbp:info: Successfully imported version 27.0.0 of ../druid-new/apache-druid-27.0.0-src/distribution/target/apache-druid-27.0.0-bin.tar.gz

Updated the version of the mysql-connector for Java from 5.1.48 to 5.1.49

(base) btullis@barracuda:~/wmf/druid$ cd debian/extensions/mysql-metadata-storage/

(base) btullis@barracuda:~/wmf/druid/debian/extensions/mysql-metadata-storage$ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
--2026-01-12 10:39:42--  https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.49/mysql-connector-java-5.1.49.jar
Resolving repo1.maven.org (repo1.maven.org)... 2606:4700::6812:120c, 2606:4700::6812:130c, 104.18.19.12, ...
Connecting to repo1.maven.org (repo1.maven.org)|2606:4700::6812:120c|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1006904 (983K) [application/java-archive]
Saving to: ‘mysql-connector-java-5.1.49.jar’

mysql-connector-java-5.1.49.jar                      100%[=====================================================================================================================>] 983.30K  --.-KB/s    in 0.1s    

2026-01-12 10:39:42 (8.20 MB/s) - ‘mysql-connector-java-5.1.49.jar’ saved [1006904/1006904]

(base) btullis@barracuda:~/wmf/druid/debian/extensions/mysql-metadata-storage$ rm mysql-connector-java-5.1.48.jar

Updated the list of included binaries.

(base) btullis@barracuda:~/wmf/druid$ find {debian,extensions,hadoop-dependencies,lib} -name "*.jar" | sort > debian/source/include-binaries

Updated the debian/changelog with the new version details.

(base) btullis@barracuda:~/wmf/druid$ dch -i

Committed my changes, then pushed both the master and debian branches to gerrit.

Then on build2002 I did the following.
Checked out the repository:

btullis@build2002:~$ git clone "https://gerrit.wikimedia.org/r/operations/debs/druid" && (cd "druid" && mkdir -p `git rev-parse --git-dir`/hooks/ && curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://gerrit.wikimedia.org/r/tools/hooks/commit-msg && chmod +x `git rev-parse --git-dir`/hooks/commit-msg)

Added a tag for the current upstream version.

btullis@build2002:~/druid$ git tag upstream/27.0.0-wmf0

Then built the packages:

btullis@build2002:~/druid$ GIT_PBUILDER_AUTOCONF=no WIKIMEDIA=yes gbp buildpackage -sa -us -uc --git-pbuilder --git-no-pbuilder-autoconf --source-option="--include-removal" --git-arch=amd64 --git-dist=bullseye --git-color=on

The resulting packages were generated in /var/cache/pbuilder/result/bullseye-amd64

root@build2002:/var/cache/pbuilder/result/bullseye-amd64# ls -l
total 1138812
-rw-r--r-- 1 btullis wikidev      6577 Jan 12 11:12 druid_27.0.0-wmf0-1_amd64.buildinfo
-rw-r--r-- 1 btullis wikidev      3622 Jan 12 11:12 druid_27.0.0-wmf0-1_amd64.changes
-rw-r--r-- 1 btullis wikidev    967604 Jan 12 11:11 druid_27.0.0-wmf0-1.debian.tar.xz
-rw-r--r-- 1 btullis wikidev      1214 Jan 12 11:11 druid_27.0.0-wmf0-1.dsc
-rw-r--r-- 1 btullis wikidev      1531 Jan 12 11:12 druid_27.0.0-wmf0-1_source.changes
-rw-r--r-- 1 btullis wikidev 421686609 Jan 12 11:10 druid_27.0.0-wmf0.orig.tar.gz
-rw-r--r-- 1 btullis wikidev      5236 Jan 12 11:11 druid-broker_27.0.0-wmf0-1_all.deb
-rw-r--r-- 1 btullis wikidev 398231556 Jan 12 11:12 druid-common_27.0.0-wmf0-1_all.deb
-rw-r--r-- 1 btullis wikidev      5108 Jan 12 11:11 druid-coordinator_27.0.0-wmf0-1_all.deb
-rw-r--r-- 1 btullis wikidev      5452 Jan 12 11:11 druid-historical_27.0.0-wmf0-1_all.deb
-rw-r--r-- 1 btullis wikidev      5576 Jan 12 11:11 druid-middlemanager_27.0.0-wmf0-1_all.deb
-rw-r--r-- 1 btullis wikidev      5108 Jan 12 11:11 druid-overlord_27.0.0-wmf0-1_all.deb
<snip>

I copied these to the apt server:

btullis@apt1002:~$ rsync -av build2002.codfw.wmnet::pbuilder-result/bullseye-amd64/druid* .
receiving incremental file list
druid-broker_27.0.0-wmf0-1_all.deb
druid-common_27.0.0-wmf0-1_all.deb
druid-coordinator_27.0.0-wmf0-1_all.deb
druid-historical_27.0.0-wmf0-1_all.deb
druid-middlemanager_27.0.0-wmf0-1_all.deb
druid-overlord_27.0.0-wmf0-1_all.deb
druid_27.0.0-wmf0-1.debian.tar.xz
druid_27.0.0-wmf0-1.dsc
druid_27.0.0-wmf0-1_amd64.buildinfo
druid_27.0.0-wmf0-1_amd64.changes
druid_27.0.0-wmf0-1_source.changes
druid_27.0.0-wmf0.orig.tar.gz

sent 256 bytes  received 821,126,644 bytes  96,603,164.71 bytes/sec
total size is 820,925,193  speedup is 1.00

Then I installed them to the apt repository with reprepro:

btullis@apt1002:~$ sudo -i reprepro include bullseye-wikimedia `pwd`/druid_27.0.0-wmf0-1_amd64.changes
Exporting indices...
Deleting files no longer referenced...
btullis@apt1002:~$ sudo -i reprepro ls druid
druid |   0.19.wmf0-1 |   buster-wikimedia | source
druid | 27.0.0-wmf0-1 | bullseye-wikimedia | source

Now I can install them on an-test-druid1001 and see how they behave.

Ah, I failed to take account of something from the changelogs that @elukey had already mentioned.

Version 0.21 deprecates Zookeeper 3.4: https://github.com/apache/druid/releases/tag/druid-0.21.0#21-deprecate-zk-3.4

Zookeeper version 3.4 is the version of zookeeper that is included with Debian bullseye, and as such is it the version of zookeeper that is colocated with all of our druid clusters.

btullis@an-test-druid1001:/var/log/druid$ apt-cache policy zookeeper
zookeeper:
  Installed: 3.4.13-6+deb11u1
  Candidate: 3.4.13-6+deb11u1
  Version table:
 *** 3.4.13-6+deb11u1 500
        500 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages
        500 http://security.debian.org/debian-security bullseye-security/main amd64 Packages
        100 /var/lib/dpkg/status

There would seem to be two possible ways forward:

  1. Upgrade the druid hosts to bookworm.
  2. Switch from using a colocated zookeeper cluster to a different cluster, which uses a more recent zookeeper version.

I am tempted to try the first option.

However, we know that there are some incompatibility issues between our Hadoop packages and Bookworm, but I believe that these are limited to the hive and hive-hcatalog` packages.
Neither of these are installed on an-test-druid1001 - so it might be fine to use the bigtop 1.5 packages for bookworm on this host.

If that doesn't work, then we can switch the zookeeper URL to zookeeper-test1002, instead of localhost.

I should also mention that the services aren't working with the current zookeeper version.

I get various errors from the different services:

From overlord.log file:

2026-01-12T11:58:56,700 ERROR org.apache.druid.curator.CuratorModule: Unhandled error in Curator, stopping server.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.5.10.jar:3.5.10]

From coordinator.log

2026-01-12T11:54:38,442 ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl: Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.5.10.jar:3.5.10]

From broker.log

2026-01-12T11:38:19,094 ERROR org.apache.curator.x.discovery.details.ServiceDiscoveryImpl: Could not re-register instances after reconnection
org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /druid/analytics-test-eqiad/discovery/druid:broker/d98a7643-501a-4c86-be39-37ba26793a8e
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) ~[zookeeper-3.5.10.jar:3.5.10]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.5.10.jar:3.5.10]
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) ~[zookeeper-3.5.10.jar:3.5.10]

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-test-druid1001 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-test-druid1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-test-druid1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-test-druid1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-test-druid1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-test-druid1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-test-druid1001 (FAIL)
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-test-druid1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1003 for host an-test-druid1001.eqiad.wmnet with OS bookworm executed with errors:

  • an-test-druid1001 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601130753_brouberol_1200081_an-test-druid1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-test-druid1001.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

We have at least two problems with the puppet run for a bookworm host running Druid 27.0.0

  • We need the prometheus-druid-exporter to be made available for bookworm.
  • The zookeeper quorum that is colocated with Druid has been using OpenJDK-11, which is not available for bookworm.

For the first issue, we can simply copy the prometheus exporter from bullseye-wikimedia to bookworm-wikimedia, as it's plain python.

For the second, we should probably try to standardize on OpenJDK 17, which is what the upstreram project recommends.

Druid fully supports Java 8u92+, Java 11, and Java 17. The project team recommends Java 17.

For the second, we should probably try to standardize on OpenJDK 17, which is what the upstreram project recommends.

FWIW, Java 17 is also available for Bullseye, so this allows setting up a J17 cluster across Debian 11/12

FWIW, Java 17 is also available for Bullseye, so this allows setting up a J17 cluster across Debian 11/12

Nice, thanks. That could be very useful.


I copied the package like this:

btullis@apt1002:~/bookworm$ sudo -i reprepro ls prometheus-druid-exporter
prometheus-druid-exporter | 0.8-1 |   buster-wikimedia | amd64, i386, source
prometheus-druid-exporter | 0.8-2 | bullseye-wikimedia | amd64, i386, source
btullis@apt1002:~/bookworm$ sudo -i reprepro copy bookworm-wikimedia bullseye-wikimedia prometheus-druid-exporter
Exporting indices...
btullis@apt1002:~/bookworm$ sudo -i reprepro ls prometheus-druid-exporter
prometheus-druid-exporter | 0.8-1 |   buster-wikimedia | amd64, i386, source
prometheus-druid-exporter | 0.8-2 | bullseye-wikimedia | amd64, i386, source
prometheus-druid-exporter | 0.8-2 | bookworm-wikimedia | amd64, i386, source

Now I'll work on a puppet patch to select only Java 17 when running on bookworm, for both druid and it's colocated zookeeper cluster.

Change #1226219 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the Java version and other settings for the druid test cluster

https://gerrit.wikimedia.org/r/1226219

Change #1226219 merged by Btullis:

[operations/puppet@production] Update the Java version and other settings for the druid test cluster

https://gerrit.wikimedia.org/r/1226219

Change #1226251 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Only install the JRE instead of the JDK on druid-test hosts.

https://gerrit.wikimedia.org/r/1226251

Change #1226270 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] druid_exporter: duplicate config from druid 0.19.0

https://gerrit.wikimedia.org/r/1226270

Change #1226251 merged by Btullis:

[operations/puppet@production] Only install the JRE instead of the JDK on druid-test hosts.

https://gerrit.wikimedia.org/r/1226251

Change #1226270 merged by Brouberol:

[operations/puppet@production] druid_exporter: duplicate config from druid 0.19.0

https://gerrit.wikimedia.org/r/1226270

Change #1226302 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove GC logging options that are incpatible with Java 17 on druid-test

https://gerrit.wikimedia.org/r/1226302

Change #1226302 merged by Btullis:

[operations/puppet@production] Remove GC logging options that are incompatible with Java 17 on druid-test

https://gerrit.wikimedia.org/r/1226302

Change #1226766 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] druid_exporter: Fixup metric definition

https://gerrit.wikimedia.org/r/1226766

Change #1226844 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] druid: inject flags allowing druid to access protected classes in java > 8

https://gerrit.wikimedia.org/r/1226844

Change #1226844 merged by Brouberol:

[operations/puppet@production] druid: inject flags allowing druid to access protected classes in java > 8

https://gerrit.wikimedia.org/r/1226844

Change #1227754 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Tweak druid configuration to enable druid 27.0 to run on jvm8

https://gerrit.wikimedia.org/r/1227754

Change #1227754 merged by Brouberol:

[operations/puppet@production] Tweak druid configuration to enable druid 27.0 to run on jvm8

https://gerrit.wikimedia.org/r/1227754

Change #1227771 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] an-test-druid: run zookeeper on the default system java version

https://gerrit.wikimedia.org/r/1227771

Change #1227786 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] an-test-druid: disable noisy GC stat logging

https://gerrit.wikimedia.org/r/1227786

Change #1228498 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Remove test druid cluster noisy jvm GC params

https://gerrit.wikimedia.org/r/1228498

Change #1228498 merged by Btullis:

[operations/puppet@production] Remove test druid cluster noisy jvm GC params

https://gerrit.wikimedia.org/r/1228498

We had to revert to version 26.0.0 because the Hadoop 2 support in 27.0.0 seems inconsistent, even when we build it with the -P dist-hadoop2 flag.

However, we are now happy with this version on an-test-druid1001 and @JAllemandou has successfully tested both batch ingestion and real-time ingestion jobs.

Now we have to think about how and when to carry out the upgrade.

Druid does support rolling upgrades, but this will be a bit complicated for us, because of the need to upgrade the base O/S at the same time.

For us to be able to do a rolling upgrade, we would need to do the following:

  • Build a druid version 0.19.0 package for bookworm
  • Upgrade one host at a time to bookworm, running Druid 0.19.0
  • Run a zookeeper quorum with mixed versions 3.4 and 3.8 of zookeeper
  • When all hosts have been upgraded to bookworm and the zookeeper quorum is stable with version 3.8
    • Deploy the druid packages to 26.0.0 across the cluster - automatic restarts do not occur
    • Deploy the puppet change to update the JAVA_OPTS and similar changes.
    • Restart all of the daemons in the following order:
      • Historical
      • Overlord
      • Middle Manager
      • Broker
      • Coordinator

So it's do-able, but a bit complex.
It would be a lot simpler if we could schedule some downtime for the service and do a bulk reimage of all five an-druid servers to bookworm.

I'll check with the Data-Engineering team whether they're happy for an hour's downtime, and also whether this is realistic for the SRE team.

Totally fine from SRE's side, just check in with the current oncallers before you begin the disruptive part of the maintenance.

BTullis renamed this task from Upgrade Druid to version 27.0.0 to Upgrade Druid to version 26.0.0.Tue, Jan 20, 3:50 PM
BTullis updated the task description. (Show Details)

Change #1229539 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Upgrade druid to version 26.0.0 on the analytics cluster

https://gerrit.wikimedia.org/r/1229539

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1003.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1005.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1006.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1007.eqiad.wmnet with OS bookworm

Change #1229539 merged by Btullis:

[operations/puppet@production] Upgrade druid to version 26.0.0 on the analytics cluster

https://gerrit.wikimedia.org/r/1229539

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1007.eqiad.wmnet with OS bookworm executed with errors:

  • an-druid1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console an-druid1007.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host an-druid1007.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1006.eqiad.wmnet with OS bookworm completed:

  • an-druid1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601211144_btullis_2887727_an-druid1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1005.eqiad.wmnet with OS bookworm completed:

  • an-druid1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601211149_btullis_2887704_an-druid1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1003.eqiad.wmnet with OS bookworm completed:

  • an-druid1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601211155_btullis_2887617_an-druid1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1004.eqiad.wmnet with OS bookworm completed:

  • an-druid1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601211159_btullis_2887678_an-druid1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host an-druid1007.eqiad.wmnet with OS bookworm completed:

  • an-druid1007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202601211203_btullis_2889512_an-druid1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

We have now upgraded the druid_analytics cluster and all seems well, wlthough there is still a little back-filling to be completed on the webrequest_sampled_live and wmf_netflow datasets. I believe that @JAllemandou is working on this.

There were a few points of note, during the upgrade.

  • an-druid100[3-4] did not have their vg0/srv LVM logical volume wiped during the upgrade, whereas an-druid100[5-7] did.

This turns out to have been beneficial, in terms of the time it took to reload data from deep storage. We should try to make sure that this volume is also retained when we reimage the druid_public cluster.

Checking the configuration, we can see that this shouldn't have been a surprise, since this is how it is configured.

Similarly, an-druid100[3-4] waited for input at the partman stage of the installer, because they are configured with the reuse-parts-test.cfg recipe.

  • There was a race condition in puppet, which meant that all druid services needed to be restarted manually, after installation. The druid-broker and druid-historical services failed to start, therefore blocking the cookbook from exiting cleanly, but we saw that all other services were behaving strangely, so it's possible that they were started before puppet had finished configuring them.

Change #1230321 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure druid clusters to reuse their /srv volume

https://gerrit.wikimedia.org/r/1230321

Change #1230321 merged by Btullis:

[operations/puppet@production] Configure druid clusters to reuse their /srv volume

https://gerrit.wikimedia.org/r/1230321

Data has been backfilled, no hole in the webrequest_sampled_live data anymore. I have created T415359: Batch index webrequest_sampled data in Druid to automatize batch-indexing the webrequest_sampled data.

There still is an issue with realtime tasks not being able to properly launch replicas:

  • Realtime jobs are configured to so that their realtime tasks have 3 replicas for redundancy and query-paralellization)
  • Currently only one of them actually runs, the other two always waiting in PENDING status.
  • It's also worth noticing that all of the running tasks have run on an-druid1003, as if tasks couldn't be launched remotely

We should investigate this when folks are back from the SRE summit.

Hey folks, I was using turnilo today with queries like https://w.wiki/Ha94, plus also some regexes, and I noticed a general slowdown (but eventually it worked). I then checked the Historical's GC metrics and I found this:

Screenshot From 2026-01-23 17-51-22.png (2×4 px, 467 KB)

The heap size didn't seem to be under pressure, so I am wondering if it is just something related to queries (maybe with regexes) that allocate a ton of object with small life, hitting some constraints like small young gen etc.. Since we use G1 now, we could try to increase something like -XX:G1MaxNewSizePercent and see how it goes.

The heap size didn't seem to be under pressure, so I am wondering if it is just something related to queries (maybe with regexes) that allocate a ton of object with small life, hitting some constraints like small young gen etc.. Since we use G1 now, we could try to increase something like -XX:G1MaxNewSizePercent and see how it goes.

Apologies, I'm not up to speed on anything JRE since about ten years, so here's a naive question -- is there a straightforward way to sample some call stacks that triggered allocations?

I'm not strong either at GC and java internal (stack dumps etc). I'm willing to help though.
It seems that the default -XX:G1MaxNewSizePercent is 60% - Do you think we should move that value to?

My high-level feeling is that the query itself is expensive for Druid, which performs some kind of "table scan", itself creating objects, which then get GC-ed. This is not backed by any kind of Druid knowledge though, just an intuition.

One thing we could try would be to have @CDanis run a query that we know is expensive in Druid, and perform thread dumps of the druid services, to get a better perspective of what's happening there.

Something else to take into account: When using turnilo, multiple druid queries are issued to provide the UI. For the example Luca gave above, I counted 12 distinct queries.
In this example there is one query for the main vizualization, and 11 for the right-side panel (pinboard). Each of those 11 queries take the same filters as the main one, so if it's a complex reqex, well it'll be recomputed many times with sub-level filtering.
New versions of tunrilo seem to provide the capability to add/remove that panel on-demand, that should be handy! Let's maybe plan on upgrading turnilo?

I'm not strong either at GC and java internal (stack dumps etc). I'm willing to help though.
It seems that the default -XX:G1MaxNewSizePercent is 60% - Do you think we should move that value to?

Ah I thought it was less, so maybe tuning it doesn't lead to a lot of results (worth to test though). Maybe we could turn on GC logging, replicate the 12 distinct queries scenario that you outlined and see what the GC does with young gen? We'll know for sure why it triggers, and maybe we'll have a better idea on what to tune. What do you think?

Change #1227771 abandoned by Brouberol:

[operations/puppet@production] an-test-druid: run zookeeper on the default system java version

Reason:

We've decided to keep druid and zookeeper on the same JVM to keep things simpler

https://gerrit.wikimedia.org/r/1227771

Change #1227786 abandoned by Brouberol:

[operations/puppet@production] an-test-druid: disable noisy GC stat logging

Reason:

Alraedy done

https://gerrit.wikimedia.org/r/1227786

We're still running Druid with Java 8, while https://druid.apache.org/docs/latest/tutorials/ mentions Java 17 as the required version to _build_ Druid (which supposedly also means that Java 17 would be an adequate version to _run_ Druid I guess?)

How about we try moving to Java 17 on an-test-druid (and then prod), maybe this fully obsoletes the GC issues we've seen?

In fact druid-broker and druid-common already depend on openjdk-17, but Hiera still defaults to Java 8 in hieradata/role/common/druid/analytics/worker.yaml:

profile::java::java_packages:
  - version: '8'  # for druid
    variant: 'jre'
  - version: '17' # for zookeeper
    variant: 'jre'

@MoritzMuehlenhoff we are not using Druid's latest version, because of hadoop dependencies.
We are on Druid 26.0 currently, for which java 8 is still recommended: https://druid.apache.org/docs/26.0.0/tutorials/.
We have plans to try get away from Hadoop for ingestion when we receive new hardware for a new cluster {T413446}.

I think we still need to upgrade druid public before closing this task. @BTullis / @JAllemandou : can you confirm?

I think we still need to upgrade druid public before closing this task. @BTullis / @JAllemandou : can you confirm?

Absolutely yes! This needs to stay open until we have upgraded the druid-public cluster.

We haven't yet set a date for the upgrade.

I have asked the Data-Engineering team to guide us on this and manage any required comms, since there will be an unavoidable period of downtime that will affect public-facing features.

These features include edit and editor metrics, which are served via AQS.

cc: @Ahoelzl and @GGoncalves-WMF

We will need a maintenance window of around an hour, although the downtime might well be less than that, since we have learnt some lessons after upgrading the druid_analytics cluster.

Change #1226766 abandoned by Brouberol:

[operations/puppet@production] druid_exporter: Fixup metric definition

Reason:

Already managed by joal

https://gerrit.wikimedia.org/r/1226766