Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Duplicate | Ottomata | T157977 Upgrade druid | |||
Declined | Ottomata | T164007 Update pivot to latest source | |||
Resolved | elukey | T164008 Update druid to 0.10 | |||
Resolved | elukey | T166248 Upgrade Analytics Cluster to Java 8 |
Event Timeline
Some extra work for debianization for the new SQL apache calcilte server. Maybe puppet changes for overlord/coordinator setup.
Change 351691 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Changes needed for upgrading to Druid 0.10
Change 351691 merged by Ottomata:
[operations/puppet@production] Changes needed for upgrading to Druid 0.10
Mentioned in SAL (#wikimedia-operations) [2017-05-24T13:54:52Z] <elukey> upgrade Druid daemons on druid100[123] to 0.10 - T164008
Change 355430 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/druid@master] Release 0.10.0-2
Change 355469 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade
Change 355471 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Changes needed for upgrading to Druid 0.10
We did this upgrade today, but ended having to roll back to 0.9.0. Druid 0.10 requires Java 8, which is fine. But the Analytics Hadoop cluster runs Java 7, and Hadoop indexing tasks were failing. See: https://groups.google.com/forum/#!topic/druid-user/aTGQlnF1KLk
We have to upgrade Hadoop to Java 8 before we can upgrade Druid to 0.10.
When we do finally try again with this, we'll need to merge https://gerrit.wikimedia.org/r/355469 and https://gerrit.wikimedia.org/r/355471.
Oof, Hadoop druid loading jobs are still failing, even after rolling back:
2017-05-24T20:59:07,294 ERROR [main] org.apache.hadoop.mapred.YarnChild - Error running child : java.lang.UnsupportedClassVersionError: io/druid/storage/hdfs/HdfsStorageDruidModule : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:803) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:442) at java.net.URLClassLoader.access$100(URLClassLoader.java:64) at java.net.URLClassLoader$1.run(URLClassLoader.java:354) at java.net.URLClassLoader$1.run(URLClassLoader.java:348) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:347) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:363) at java.util.ServiceLoader$1.next(ServiceLoader.java:445) at io.druid.initialization.Initialization.getFromExtensions(Initialization.java:133) at io.druid.initialization.Initialization.makeInjectorWithModules(Initialization.java:320) at io.druid.indexer.HadoopDruidIndexerConfig.<clinit>(HadoopDruidIndexerConfig.java:98) at io.druid.indexer.IndexGeneratorJob$IndexGeneratorReducer.setup(IndexGeneratorJob.java:529) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:421) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
I've suspended all druid hadoop loading oozie coordinators for now.
What's weird is, @JAllemandou and I actually saw a job complete, even though some of its mappers failed in the above way. Something is not right. It looks like there is some java version and/or druid jar version problem in some places but not others. I don't know how anything I did today would have affected stuff deployed to Hadoop nodes.
Will have to look more into this tomorrow. :/
This was
Problem found: There were broken links to version 0.10.0 of hdfs extension on every druid machine.
Removing them solved the problem of unpredictable failures.
Jobs restarted.
Change 355469 merged by Ottomata:
[operations/puppet@production] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade
Restarting to work on this after the Hadoop cluster has been migrated to Java 8. The latest stable release is currently 0.12, meanwhile we are running 0.9.2. The previous attempt was targeting 0.10.
Some thoughts from Luca the paranoid opsen, reading release notes for 0.11 and 0.12:
0.12: https://github.com/druid-io/druid/releases/tag/druid-0.12.0
Rollback restrictions Please note that after upgrading to 0.12.0, it is no longer possible to downgrade to a version older than 0.11.0, due to changes made in #4762. It is still possible to roll back to version 0.11.0.
Given what happened the last time, there could be a lot of reasons that we don't foresee to rollback in case the new version is not stable, and doing so to a version that we haven't battle tested in production is a no go for me, but of course we can discuss it :)
0.11: https://github.com/druid-io/druid/releases/tag/druid-0.11.0
Upgrading coordinators and overlords The following patch changes the way coordinator->overlord redirects are handled: #5037 The overlord leader election algorithm has changed in 0.11.0: #4699. As a result of the two patches above, special care is needed when upgrading Coordinator or Overlord to 0.11.0. All coordinators and overlords must be shut down and upgraded together. For example, to upgrade Coordinators, you would shutdown all coordinators, upgrade them to 0.11.0 and then start them. Overlords should be upgraded in a similar way. During the upgrade process, there must not be any time period where a non-0.11.0 coordinator or overlord is running simultaneously with an 0.11.0 coordinator or overlord. Note that at least one overlord should be brought up as quickly as possible after shutting them all down so that peons, tranquility etc continue to work after some retries.
This is slightly less problematic since rolling back is still possible but it might be better to test the upgrade in labs first, since I feel that getting the procedure right is more subtle than what written above. No veto this time :)
0.10: https://github.com/druid-io/druid/releases/tag/druid-0.10.0
This one seems to require a simple rolling upgrade like described in http://druid.io/docs/0.10.0/operations/rolling-updates.html. Pros are: @Ottomata already prepared the Debian package and the puppet patches, plus it seems that this version was already tested in labs.
So given the fact that we are now running two important clusters I'd prefer to work on upgrading to 0.10 and unblock immediately our use cases. Then I could start straight away to work on 0.11, and then 0.12 if needed.
K sounds good. I'd go for 0.11 (after labs testing), but if you prefer to 0.10 first, that sounds fine too.
Change 427657 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/debs/druid@debian] Release 0.10.1-3~jessie
After a chat with the team we decided to proceed with Druid 0.10 for the moment, since we have basically everything that we need ready to go.
I added a couple of small/minor improvements (hopefully) to the debian package for 0.10 (https://gerrit.wikimedia.org/r/427657), the package builds fine on boron. There was already one host in the analytics labs project running the druid analytics worker profile, I added other two:
- d-1.analytics.eqiad.wmflabs
- d-2.analytics.eqiad.wmflabs
- d-3.analytics.eqiad.wmflabs
I had a chat with @JAllemandou and we decided to run some tests in labs for basic indexation and real time ingestion before rolling out the new version.
Change 427657 merged by Elukey:
[operations/debs/druid@debian] Release 0.10.0-3~jessie
First step of testing confirmed on labs with druid 0.9.2:
- Indexation from hadoop
- Realtime indexation with tranquility
Only issue we should fix before upgrading is similar to the one I had on hadoop-coordinator-1: space left devices, but for worker devices (hadoop-worker-[1|2|3]).
Except from that I think we're ready to upgrade and test :)
Great!
Only issue we should fix before upgrading is similar to the one I had on hadoop-coordinator-1: space left devices, but for worker devices (hadoop-worker-[1|2|3]).
So now /var/lib/hadoop on every hadoop-worker is held on a 60G partition. Weirdly the hdfs df shows a lot more than expected:
Filesystem Size Used Available Use% hdfs://analytics-hadoop-labs 353.6 G 15.3 G 296.3 G 4%
But I think we observed this issue in labs already right?
Upgraded d[1-3] in labs to druid 0.10, adding manual hiera config as replacement for https://gerrit.wikimedia.org/r/#/c/355471.
Also added prometheus monitoring, since the druid agent has not been tested yet with druid 0.10.
After memory tricks from @elukey , both hadoop indexation and realtime indexation went fine (without any change - Incredible).
Let's plan on an update next week for the druid-analytics cluster.
Pivot deployed on d-1, usable via:
ssh -L 9090:localhost:9090 d-1.analytics.eqiad.wmflabs -N
Out of the box I am seeing these in the logs:
Apr 23 08:10:41 d-1 pivot[23961]: Scanning cluster 'druid' for new sources Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' has never seen 'webrequest' and will introspect 'webrequest' Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' making external for 'webrequest_live' Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' encountered and error during SourceListRefresh: only druid versions >= 0.8.0 are supported Apr 23 08:10:41 d-1 pivot[23961]: Cluster 'druid' could not introspect 'webrequest' because: only druid versions >= 0.8.0 are supported
Me and the JS master @fdans checked the Pivot's code this morning, and after a lot of tests we identified what returns the error:
node_modules/plywood/build/plywood.js
3962 this._ensureMinVersion("0.8.0");
commenting the line makes Pivot work again. The function is this one:
3148 External.prototype._ensureMinVersion = function (minVersion) { 3149 if (this.version && External.versionLessThan(this.version, minVersion)) { 3150 throw new Error("only " + this.engine + " versions >= " + minVersion + " are supported. You are using " + this.version); 3151 } 3152 };
That brings to:
2877 External.versionLessThan = function (va, vb) { 2878 var pa = va.split('-')[0].split('.'); 2879 var pb = vb.split('-')[0].split('.'); 2880 if (pa[0] !== pb[0]) 2881 return pa[0] < pb[0]; 2882 if (pa[1] !== pb[1]) 2883 return pa[1] < pb[1]; 2884 return pa[2] < pb[2]; 2885 };
We added a log to line 3150 to emit this.version, that is correctly set to 0.10.0. The versionLessThan seems to be called with the wrong order of parameters, since at some point the line 2883 is reached (10 < 8), that returns false of course. Flipping the arguments of versionLessThan, but another question would be.. How does it work now with 0.9.2? Shouldn't it return the same issue?
Anyhow, I'd propose to just comment line 3962 in node_modules/plywood/build/plywood.js and be done with it :)
console.log("10" < "8"); console.log(parseInt("10",10) < parseInt("8",10)); console.log("9" < "8"); console.log(parseInt("9",10) < parseInt("8",10)); true false false false
Change 428331 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/pivot/deploy@master] Comment Druid version check not compatible with 0.10.0+
Change 428331 merged by Milimetric:
[analytics/pivot/deploy@master] Fix Druid version check not compatible with 0.10.0+
Mentioned in SAL (#wikimedia-operations) [2018-04-23T19:34:13Z] <elukey@tin> Started deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008
Mentioned in SAL (#wikimedia-operations) [2018-04-23T19:34:29Z] <elukey@tin> Finished deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008 (duration: 00m 17s)
Mentioned in SAL (#wikimedia-operations) [2018-04-24T08:14:02Z] <elukey> upload druid_0.10.0-3~jessie1 (collection of druid packages) to jessie-wikimedia - T164008
Change 355471 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade Druid to 0.10
Change 430296 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: prep work before upgrade to Druid 0.10
Change 430298 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: upgrade zookeeper to 3.4.9
Mentioned in SAL (#wikimedia-operations) [2018-05-02T07:31:48Z] <elukey> upgrade zookeeper on druid100[1-3] to 3.4.9 - T164008
Change 430298 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: upgrade zookeeper to 3.4.9
Change 430296 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: prep work before upgrade to Druid 0.10
Mentioned in SAL (#wikimedia-operations) [2018-05-02T08:11:25Z] <elukey> upgrading Druid to 0.10 on druid100[4-6] (wikistats 2 backend) - T164008
Change 430312 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: add druid alerts for available segments
Change 430312 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: add druid alerts for available segments
Change 430318 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::analytics::worker: enable new Druid SQL feature
Change 430318 merged by Elukey:
[operations/puppet@production] role::druid::analytics::worker: enable new Druid SQL feature
Change 430362 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: upgrade zookeeper to 3.4.9
Change 430362 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: upgrade zookeeper to 3.4.9
Mentioned in SAL (#wikimedia-operations) [2018-05-02T13:29:03Z] <elukey> upgrade zookeeper to 3.4.9 on druid100[4-6] (wikistats 2 backend) - T164008