Page MenuHomePhabricator

Update druid to 0.10
Closed, ResolvedPublic13 Estimated Story Points

Event Timeline

Nuria renamed this task from Update druid to latest source to Update druid to latest release.Apr 27 2017, 3:51 PM
Nuria created this task.

Some extra work for debianization for the new SQL apache calcilte server. Maybe puppet changes for overlord/coordinator setup.

Nuria set the point value for this task to 13.Apr 27 2017, 4:00 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.

Change 351691 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Changes needed for upgrading to Druid 0.10

https://gerrit.wikimedia.org/r/351691

Change 351691 merged by Ottomata:
[operations/puppet@production] Changes needed for upgrading to Druid 0.10

https://gerrit.wikimedia.org/r/351691

Mentioned in SAL (#wikimedia-operations) [2017-05-24T13:54:52Z] <elukey> upgrade Druid daemons on druid100[123] to 0.10 - T164008

Change 355430 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/druid@master] Release 0.10.0-2

https://gerrit.wikimedia.org/r/355430

Change 355469 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade

https://gerrit.wikimedia.org/r/355469

Change 355471 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Changes needed for upgrading to Druid 0.10

https://gerrit.wikimedia.org/r/355471

We did this upgrade today, but ended having to roll back to 0.9.0. Druid 0.10 requires Java 8, which is fine. But the Analytics Hadoop cluster runs Java 7, and Hadoop indexing tasks were failing. See: https://groups.google.com/forum/#!topic/druid-user/aTGQlnF1KLk

We have to upgrade Hadoop to Java 8 before we can upgrade Druid to 0.10.

When we do finally try again with this, we'll need to merge https://gerrit.wikimedia.org/r/355469 and https://gerrit.wikimedia.org/r/355471.

Change 355430 merged by Ottomata:
[operations/debs/druid@master] Release 0.10.0-2

https://gerrit.wikimedia.org/r/355430

Oof, Hadoop druid loading jobs are still failing, even after rolling back:

2017-05-24T20:59:07,294 ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child : java.lang.UnsupportedClassVersionError: io/druid/storage/hdfs/HdfsStorageDruidModule : Unsupported major.minor version 52.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:442)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:64)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:354)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:348)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:347)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:278)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:363)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
	at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:133)
	at io.druid.initialization.Initialization.makeInjectorWithModules(Initialization.java:320)
	at io.druid.indexer.HadoopDruidIndexerConfig.<clinit>(HadoopDruidIndexerConfig.java:98)
	at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.setup(IndexGeneratorJob.java:529)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
	at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:421)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I've suspended all druid hadoop loading oozie coordinators for now.

What's weird is, @JAllemandou and I actually saw a job complete, even though some of its mappers failed in the above way. Something is not right. It looks like there is some java version and/or druid jar version problem in some places but not others. I don't know how anything I did today would have affected stuff deployed to Hadoop nodes.

Will have to look more into this tomorrow. :/

This was

Oof, Hadoop druid loading jobs are still failing, even after rolling back:

Problem found: There were broken links to version 0.10.0 of hdfs extension on every druid machine.
Removing them solved the problem of unpredictable failures.
Jobs restarted.

Totalllly weeiiiiiird! Thanks for finding that!

Ottomata moved this task from Incoming to Blocked on the Analytics board.

Change 355469 merged by Ottomata:
[operations/puppet@production] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade

https://gerrit.wikimedia.org/r/355469

Nuria renamed this task from Update druid to latest release to Update druid to latest release (0.10).Dec 19 2017, 10:34 PM
Milimetric renamed this task from Update druid to latest release (0.10) to Update druid to latest release (0.11).Mar 1 2018, 6:14 PM
Milimetric moved this task from Blocked to Operational Excellence Future on the Analytics board.
mforns raised the priority of this task from High to Needs Triage.Apr 16 2018, 4:25 PM
mforns triaged this task as High priority.

Restarting to work on this after the Hadoop cluster has been migrated to Java 8. The latest stable release is currently 0.12, meanwhile we are running 0.9.2. The previous attempt was targeting 0.10.

Some thoughts from Luca the paranoid opsen, reading release notes for 0.11 and 0.12:

0.12: https://github.com/druid-io/druid/releases/tag/druid-0.12.0

Rollback restrictions
Please note that after upgrading to 0.12.0, it is no longer possible to downgrade to a version older than 0.11.0, due to changes made in #4762. It is still possible to roll back to version 0.11.0.

Given what happened the last time, there could be a lot of reasons that we don't foresee to rollback in case the new version is not stable, and doing so to a version that we haven't battle tested in production is a no go for me, but of course we can discuss it :)

0.11: https://github.com/druid-io/druid/releases/tag/druid-0.11.0

Upgrading coordinators and overlords
The following patch changes the way coordinator->overlord redirects are handled:
#5037

The overlord leader election algorithm has changed in 0.11.0: #4699.

As a result of the two patches above, special care is needed when upgrading Coordinator or Overlord to 0.11.0. All coordinators and overlords must be shut down and upgraded together.

For example, to upgrade Coordinators, you would shutdown all coordinators, upgrade them to 0.11.0 and then start them. Overlords should be upgraded in a similar way.

During the upgrade process, there must not be any time period where a non-0.11.0 coordinator or overlord is running simultaneously with an 0.11.0 coordinator or overlord.

Note that at least one overlord should be brought up as quickly as possible after shutting them all down so that peons, tranquility etc continue to work after some retries.

This is slightly less problematic since rolling back is still possible but it might be better to test the upgrade in labs first, since I feel that getting the procedure right is more subtle than what written above. No veto this time :)

0.10: https://github.com/druid-io/druid/releases/tag/druid-0.10.0

This one seems to require a simple rolling upgrade like described in http://druid.io/docs/0.10.0/operations/rolling-updates.html. Pros are: @Ottomata already prepared the Debian package and the puppet patches, plus it seems that this version was already tested in labs.

So given the fact that we are now running two important clusters I'd prefer to work on upgrading to 0.10 and unblock immediately our use cases. Then I could start straight away to work on 0.11, and then 0.12 if needed.

K sounds good. I'd go for 0.11 (after labs testing), but if you prefer to 0.10 first, that sounds fine too.

Change 427657 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/druid@debian] Release 0.10.1-3~jessie

https://gerrit.wikimedia.org/r/427657

After a chat with the team we decided to proceed with Druid 0.10 for the moment, since we have basically everything that we need ready to go.

I added a couple of small/minor improvements (hopefully) to the debian package for 0.10 (https://gerrit.wikimedia.org/r/427657), the package builds fine on boron. There was already one host in the analytics labs project running the druid analytics worker profile, I added other two:

  • d-1.analytics.eqiad.wmflabs
  • d-2.analytics.eqiad.wmflabs
  • d-3.analytics.eqiad.wmflabs

I had a chat with @JAllemandou and we decided to run some tests in labs for basic indexation and real time ingestion before rolling out the new version.

Change 427657 merged by Elukey:
[operations/debs/druid@debian] Release 0.10.0-3~jessie

https://gerrit.wikimedia.org/r/427657

Let's make sure to test whether pivot works with this release

First step of testing confirmed on labs with druid 0.9.2:

  • Indexation from hadoop
  • Realtime indexation with tranquility

Only issue we should fix before upgrading is similar to the one I had on hadoop-coordinator-1: space left devices, but for worker devices (hadoop-worker-[1|2|3]).

Except from that I think we're ready to upgrade and test :)

First step of testing confirmed on labs with druid 0.9.2:

  • Indexation from hadoop
  • Realtime indexation with tranquility

Great!

Only issue we should fix before upgrading is similar to the one I had on hadoop-coordinator-1: space left devices, but for worker devices (hadoop-worker-[1|2|3]).

So now /var/lib/hadoop on every hadoop-worker is held on a 60G partition. Weirdly the hdfs df shows a lot more than expected:

Filesystem                       Size    Used  Available  Use%
hdfs://analytics-hadoop-labs  353.6 G  15.3 G    296.3 G    4%

But I think we observed this issue in labs already right?

Upgraded d[1-3] in labs to druid 0.10, adding manual hiera config as replacement for https://gerrit.wikimedia.org/r/#/c/355471.

Also added prometheus monitoring, since the druid agent has not been tested yet with druid 0.10.

After memory tricks from @elukey , both hadoop indexation and realtime indexation went fine (without any change - Incredible).
Let's plan on an update next week for the druid-analytics cluster.

Pivot deployed on d-1, usable via:

ssh -L 9090:localhost:9090 d-1.analytics.eqiad.wmflabs -N

Out of the box I am seeing these in the logs:

Apr 23 08:10:41 d-1 pivot[23961]: Scanning cluster 'druid' for new sources
Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' has never seen 'webrequest' and will introspect 'webrequest'
Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' making external for 'webrequest_live'
Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' encountered and error during SourceListRefresh: only druid versions >= 0.8.0 are supported
Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' could not introspect 'webrequest' because: only druid versions >= 0.8.0 are supported

Me and the JS master @fdans checked the Pivot's code this morning, and after a lot of tests we identified what returns the error:

node_modules/plywood/build/plywood.js

3962         this._ensureMinVersion("0.8.0");

commenting the line makes Pivot work again. The function is this one:

3148     External.prototype._ensureMinVersion = function (minVersion) {
3149         if (this.version && External.versionLessThan(this.version, minVersion)) {
3150             throw new Error("only " + this.engine + " versions >= " + minVersion + " are supported. You are using " + this.version);
3151         }
3152     };

That brings to:

2877     External.versionLessThan = function (va, vb) {
2878         var pa = va.split('-')[0].split('.');
2879         var pb = vb.split('-')[0].split('.');
2880         if (pa[0] !== pb[0])
2881             return pa[0] < pb[0];
2882         if (pa[1] !== pb[1])
2883             return pa[1] < pb[1];
2884         return pa[2] < pb[2];
2885     };

We added a log to line 3150 to emit this.version, that is correctly set to 0.10.0. The versionLessThan seems to be called with the wrong order of parameters, since at some point the line 2883 is reached (10 < 8), that returns false of course. Flipping the arguments of versionLessThan, but another question would be.. How does it work now with 0.9.2? Shouldn't it return the same issue?

Anyhow, I'd propose to just comment line 3962 in node_modules/plywood/build/plywood.js and be done with it :)

+1 for commenting the global check :)

console.log("10" < "8");
console.log(parseInt("10",10) < parseInt("8",10));

console.log("9" < "8");
console.log(parseInt("9",10) < parseInt("8",10));

true
false
false
false

Change 428331 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/pivot/deploy@master] Comment Druid version check not compatible with 0.10.0+

https://gerrit.wikimedia.org/r/428331

elukey renamed this task from Update druid to latest release (0.11) to Update druid to 0.10.Apr 23 2018, 1:03 PM
elukey claimed this task.
elukey added a project: Analytics-Kanban.
elukey moved this task from Paused to In Code Review on the Analytics-Kanban board.

Change 428331 merged by Milimetric:
[analytics/pivot/deploy@master] Fix Druid version check not compatible with 0.10.0+

https://gerrit.wikimedia.org/r/428331

Mentioned in SAL (#wikimedia-operations) [2018-04-23T19:34:13Z] <elukey@tin> Started deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008

Mentioned in SAL (#wikimedia-operations) [2018-04-23T19:34:29Z] <elukey@tin> Finished deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008 (duration: 00m 17s)

Mentioned in SAL (#wikimedia-operations) [2018-04-24T08:14:02Z] <elukey> upload druid_0.10.0-3~jessie1 (collection of druid packages) to jessie-wikimedia - T164008

Change 355471 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade Druid to 0.10

https://gerrit.wikimedia.org/r/355471

Change 430296 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: prep work before upgrade to Druid 0.10

https://gerrit.wikimedia.org/r/430296

Change 430298 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: upgrade zookeeper to 3.4.9

https://gerrit.wikimedia.org/r/430298

Mentioned in SAL (#wikimedia-operations) [2018-05-02T07:31:48Z] <elukey> upgrade zookeeper on druid100[1-3] to 3.4.9 - T164008

Change 430298 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade zookeeper to 3.4.9

https://gerrit.wikimedia.org/r/430298

Change 430296 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: prep work before upgrade to Druid 0.10

https://gerrit.wikimedia.org/r/430296

Mentioned in SAL (#wikimedia-operations) [2018-05-02T08:11:25Z] <elukey> upgrading Druid to 0.10 on druid100[4-6] (wikistats 2 backend) - T164008

Change 430312 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: add druid alerts for available segments

https://gerrit.wikimedia.org/r/430312

Change 430312 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: add druid alerts for available segments

https://gerrit.wikimedia.org/r/430312

Change 430318 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: enable new Druid SQL feature

https://gerrit.wikimedia.org/r/430318

Change 430318 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: enable new Druid SQL feature

https://gerrit.wikimedia.org/r/430318

Change 430362 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: upgrade zookeeper to 3.4.9

https://gerrit.wikimedia.org/r/430362

Change 430362 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: upgrade zookeeper to 3.4.9

https://gerrit.wikimedia.org/r/430362

Mentioned in SAL (#wikimedia-operations) [2018-05-02T13:29:03Z] <elukey> upgrade zookeeper to 3.4.9 on druid100[4-6] (wikistats 2 backend) - T164008