Page MenuHomePhabricator

Replace Camus by Gobblin
Closed, ResolvedPublic

Description

After having tested that Gobblin works with Kafka using TLS encryption (Yay!), we decided to move with this tool.
This task is the parent task for the various smaller bits we'll have to do.

Things to do:

We have ideas of improvements that Gobblin could help us with (webrequest stats for instance), but we're gonna keep the work to just replacing camus for now.
We'll create new tasks with improvements as our knowledge of Gobblin levels up.

Details

SubjectRepoBranchLines +/-
analytics/refinery/sourcemaster+20 -21
analytics/refinery/sourcemaster+5 -1 K
analytics/refinerymaster+1 -129
operations/puppetproduction+0 -52
operations/puppetproduction+2 -385
analytics/refinerymaster+1 -4
operations/puppetproduction+0 -161
operations/puppetproduction+3 -0
operations/puppetproduction+5 -3
operations/puppetproduction+4 -4
analytics/refinery/sourcemaster+7 -1
operations/puppetproduction+28 -12
operations/puppetproduction+3 -1
operations/puppetproduction+1 -0
operations/puppetproduction+55 -46
operations/puppetproduction+7 -106
analytics/refinerymaster+22 -0
operations/puppetproduction+1 -1
operations/puppetproduction+27 -23
analytics/refinerymaster+1 -1
analytics/refinerymaster+1 -1
analytics/refinerymaster+7 -3
analytics/refinerymaster+26 -0
operations/puppetproduction+9 -1
operations/puppetproduction+2 -166
operations/puppetproduction+1 -1
operations/puppetproduction+32 -17
analytics/refinerymaster+1 -3
operations/puppetproduction+7 -3
operations/puppetproduction+1 -1
operations/puppetproduction+34 -20
analytics/refinerymaster+12 -11
analytics/refinerymaster+2 -5
operations/puppetproduction+5 -1
analytics/refinerymaster+27 -0
analytics/refinerymaster+8 -2
analytics/refinerymaster+5 -3
operations/puppetproduction+3 -1
analytics/refinerymaster+14 -17
analytics/refinerymaster+22 -0
operations/puppetproduction+16 -2
operations/puppetproduction+19 -14
operations/puppetproduction+15 -32
analytics/refinerymaster+1 -1
analytics/refinery/sourcemaster+57 -14
operations/puppetproduction+3 -3
operations/puppetproduction+2 -17
analytics/refinerymaster+1 -1
analytics/refinerymaster+5 -0
operations/puppetproduction+28 -0
analytics/refinerymaster+34 -0
analytics/refinerymaster+6 -6
operations/puppetproduction+2 -0
analytics/refinerymaster+9 -13
analytics/refinerymaster+1 -1
operations/puppetproduction+6 -2
operations/puppetproduction+1 -1
operations/puppetproduction+92 -0
analytics/refinerymaster+17 -0
analytics/refinerymaster+1 -0
analytics/refinerymaster+240 -0
analytics/refinerymaster+1 -0
analytics/gobblinwmf+44 -53
wikimedia-event-utilitiesmaster+147 -45
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 703857 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove already absented camus jobs

https://gerrit.wikimedia.org/r/703857

Change 703857 merged by Ottomata:

[operations/puppet@production] Remove already absented camus jobs

https://gerrit.wikimedia.org/r/703857

Change 703866 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add event_default gobblin job

https://gerrit.wikimedia.org/r/703866

Change 703867 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add gobblin job event_default

https://gerrit.wikimedia.org/r/703867

Change 703869 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Finalze gobblin event migration

https://gerrit.wikimedia.org/r/703869

Change 703866 merged by Ottomata:

[analytics/refinery@master] Add event_default gobblin job

https://gerrit.wikimedia.org/r/703866

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:32:55Z] <otto@deploy1002> Started deploy [analytics/refinery@1cb9e12]: Add event_default gobblin job - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:36:32Z] <otto@deploy1002> Finished deploy [analytics/refinery@1cb9e12]: Add event_default gobblin job - T271232 (duration: 03m 37s)

Change 703867 merged by Ottomata:

[operations/puppet@production] Add gobblin job event_default

https://gerrit.wikimedia.org/r/703867

Change 704117 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Set number of max mappers for gobblin event_default to 128

https://gerrit.wikimedia.org/r/704117

Change 704117 merged by Ottomata:

[analytics/refinery@master] Set number of max mappers for gobblin event_default to 128

https://gerrit.wikimedia.org/r/704117

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:49:28Z] <otto@deploy1002> Started deploy [analytics/refinery@0149c81]: Set event_default gobblin job max mappers=128 - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:52:44Z] <otto@deploy1002> Finished deploy [analytics/refinery@0149c81]: Set event_default gobblin job max mappers=128 - T271232 (duration: 03m 16s)

Change 704118 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] gobbin event_default - Fix typo

https://gerrit.wikimedia.org/r/704118

Change 704118 merged by Ottomata:

[analytics/refinery@master] gobbin event_default - Fix typo

https://gerrit.wikimedia.org/r/704118

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:56:01Z] <otto@deploy1002> Started deploy [analytics/refinery@dd65f38]: event_default gobblin job - fix typo - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-12T13:59:31Z] <otto@deploy1002> Finished deploy [analytics/refinery@dd65f38]: event_default gobblin job - fix typo - T271232 (duration: 03m 30s)

Change 704152 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Finalize gobblin migration for event_default job

https://gerrit.wikimedia.org/r/704152

Change 704152 merged by Ottomata:

[analytics/refinery@master] Finalize gobblin migration for event_default job

https://gerrit.wikimedia.org/r/704152

Mentioned in SAL (#wikimedia-operations) [2021-07-12T18:37:27Z] <otto@deploy1002> Started deploy [analytics/refinery@200b502]: Finalize event_default gobblin job - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-12T18:41:06Z] <otto@deploy1002> Finished deploy [analytics/refinery@200b502]: Finalize event_default gobblin job - T271232 (duration: 03m 39s)

Change 703869 merged by Ottomata:

[operations/puppet@production] Finalze gobblin event migration

https://gerrit.wikimedia.org/r/703869

Change 704154 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refine event - fix input_path_regex_capture_groups param

https://gerrit.wikimedia.org/r/704154

Change 704154 merged by Ottomata:

[operations/puppet@production] Refine event - fix input_path_regex_capture_groups param

https://gerrit.wikimedia.org/r/704154

Change 704157 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add gobblin job eventlogging_legacy

https://gerrit.wikimedia.org/r/704157

Change 704159 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add gobbln job eventlogging_legacy

https://gerrit.wikimedia.org/r/704159

Change 704161 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Finalize eventlogging_legacy gobblin job migration

https://gerrit.wikimedia.org/r/704161

Change 704157 merged by Ottomata:

[analytics/refinery@master] Add gobblin job eventlogging_legacy

https://gerrit.wikimedia.org/r/704157

Mentioned in SAL (#wikimedia-operations) [2021-07-13T13:53:35Z] <otto@deploy1002> Started deploy [analytics/refinery@a3bc8bc]: Add eventlogging_legacy gobblin job - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-13T13:57:04Z] <otto@deploy1002> Finished deploy [analytics/refinery@a3bc8bc]: Add eventlogging_legacy gobblin job - T271232 (duration: 03m 28s)

Change 704159 merged by Ottomata:

[operations/puppet@production] Add gobbln job eventlogging_legacy

https://gerrit.wikimedia.org/r/704159

Change 704161 merged by Ottomata:

[operations/puppet@production] Finalize eventlogging_legacy gobblin job migration

https://gerrit.wikimedia.org/r/704161

Change 704412 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure camus eventlogging job is absent

https://gerrit.wikimedia.org/r/704412

Change 704412 merged by Ottomata:

[operations/puppet@production] Ensure camus eventlogging job is absent

https://gerrit.wikimedia.org/r/704412

Change 704541 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Bump Refine spark_executor_memory to 8G

https://gerrit.wikimedia.org/r/704541

Change 704541 merged by Ottomata:

[operations/puppet@production] Bump Refine spark_executor_memory to 8G

https://gerrit.wikimedia.org/r/704541

@JAllemandou These are the OOMs in Refine we are getting, def looks gzip related:

Container: container_e18_1623774792907_150589_01_000010 on an-worker1132.eqiad.wmnet_8041_1626272654052
LogAggregationType: AGGREGATED
=======================================================================================================
LogType:stderr
LogLastModifiedTime:Wed Jul 14 14:24:14 +0000 2021
LogLength:30698
LogContents:

[...]

21/07/14 13:51:44 INFO ZlibFactory: Successfully loaded & initialized native-zlib library
21/07/14 13:52:40 INFO UnsafeExternalSorter: Thread 120 spilling sort data of 1472.0 MB to disk (0  time so far)
org.apache.spark.SparkException: Exception thrown in awaitResult:
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
	at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:121)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
	at scala.concurrent.impl.Promise$.resolver(Promise.scala:59)
	at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:51)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
	at scala.concurrent.Promise$class.complete(Promise.scala:55)
	at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
	at scala.concurrent.Promise$class.failure(Promise.scala:104)
	at scala.concurrent.impl.Promise$DefaultPromise.failure(Promise.scala:157)
	at org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:110)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:171)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:186)
	... 1 more
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.network.BlockTransferService$$anon$1.onBlockFetchSuccess(BlockTransferService.scala:111)
	at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:171)
	at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
	at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
	at org.apache.spark.network.crypto.TransportCipher$DecryptionHandler.channelRead(TransportCipher.java:186)
21/07/14 13:53:57 INFO ShutdownHookManager: Shutdown hook called

End of LogType:stderr

This is even after bumping executor memory to 8G. The jobs can be re-run. Is it possible that when tasks are scheduled on the same executor, they don't free up memory between each other well? Or, maybe too many tasks in parallel in the same executor?

Change 704563 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Tune Refine jobs in production hadoop

https://gerrit.wikimedia.org/r/704563

Change 704563 merged by Ottomata:

[operations/puppet@production] Tune Refine jobs in production hadoop

https://gerrit.wikimedia.org/r/704563

Change 704576 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Refine - explicitly uncache DataFrame when done

https://gerrit.wikimedia.org/r/704576

Change 704576 merged by Ottomata:

[analytics/refinery/source@master] Refine - explicitly uncache DataFrame when done

https://gerrit.wikimedia.org/r/704576

Change 704842 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Bump refine refinery-job version to 0.1.15

https://gerrit.wikimedia.org/r/704842

Mentioned in SAL (#wikimedia-analytics) [2021-07-15T16:44:50Z] <ottomata> deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232

Change 704842 merged by Ottomata:

[operations/puppet@production] Bump refine refinery-job version to 0.1.15

https://gerrit.wikimedia.org/r/704842

Change 705621 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add 5 minutes offset to gobblin webrequest timer

https://gerrit.wikimedia.org/r/705621

Change 705621 merged by Elukey:

[operations/puppet@production] Add 5 minutes offset to gobblin webrequest timer

https://gerrit.wikimedia.org/r/705621

Change 706492 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure remaining camus jobs are absent

https://gerrit.wikimedia.org/r/706492

Change 706492 merged by Ottomata:

[operations/puppet@production] Ensure remaining camus jobs are absent

https://gerrit.wikimedia.org/r/706492

Change 706541 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove camus puppetization

https://gerrit.wikimedia.org/r/706541

Change 706541 merged by Ottomata:

[operations/puppet@production] Remove camus puppetization

https://gerrit.wikimedia.org/r/706541

Mentioned in SAL (#wikimedia-operations) [2021-07-22T18:38:11Z] <otto@deploy1002> Started deploy [analytics/refinery@1ef4fe1]: bin/gobbin wrapper now avoids launching if job is already running - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-22T18:41:29Z] <otto@deploy1002> Finished deploy [analytics/refinery@1ef4fe1]: bin/gobbin wrapper now avoids launching if job is already running - T271232 (duration: 03m 18s)

Change 706673 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Set gobblin job.lock.dir after all

https://gerrit.wikimedia.org/r/706673

Change 706673 merged by Ottomata:

[analytics/refinery@master] Set gobblin job.lock.dir after all

https://gerrit.wikimedia.org/r/706673

Mentioned in SAL (#wikimedia-operations) [2021-07-22T18:56:06Z] <otto@deploy1002> Started deploy [analytics/refinery@3115f9e]: Set gobblin job.lock.dir after all - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-22T18:59:28Z] <otto@deploy1002> Finished deploy [analytics/refinery@3115f9e]: Set gobblin job.lock.dir after all - T271232 (duration: 03m 22s)

Mentioned in SAL (#wikimedia-operations) [2021-07-23T13:31:52Z] <otto@deploy1002> Started deploy [analytics/refinery@15521b3]: Add property disabling gobblin lock - T271232

Mentioned in SAL (#wikimedia-operations) [2021-07-23T13:35:24Z] <otto@deploy1002> Finished deploy [analytics/refinery@15521b3]: Add property disabling gobblin lock - T271232 (duration: 03m 32s)

Change 708782 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove no longer used camus module and references to camus

https://gerrit.wikimedia.org/r/708782

Change 708782 merged by Ottomata:

[operations/puppet@production] Remove no longer used camus module and references to camus

https://gerrit.wikimedia.org/r/708782

Change 708785 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refinery/job - remove already absented jobs

https://gerrit.wikimedia.org/r/708785

Change 708785 merged by Ottomata:

[operations/puppet@production] refinery/job - remove already absented jobs

https://gerrit.wikimedia.org/r/708785

Removing raw data we have stopped importing:

sudo -u hdfs hdfs dfs -rm -R /wmf/data/raw/eventlogging_client_side
sudo -u hdfs hdfs dfs -rm -R /wmf/data/raw/mediawiki_job
sudo -u hdfs hdfs dfs -rm -R /wmf/data/raw/atskafka_test_webrequest_text

Removing old camus work job state dirs

sudo -u hdfs hdfs dfs -rm -R /wmf/camus

Change 708786 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Remove refinery-camus module

https://gerrit.wikimedia.org/r/708786

Change 708787 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Refine - replace default formatters with gobblin convention

https://gerrit.wikimedia.org/r/708787

Change 708816 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Remove references to camus

https://gerrit.wikimedia.org/r/708816

Alright, I've gone through wikitech and updated relevant references to Camus.

There are 3 outstanding patches to refinery-source and refinery about removing Camus. Once those are merged and deployed, we can call this task done!

Change 708816 merged by Milimetric:

[analytics/refinery@master] Remove references to camus

https://gerrit.wikimedia.org/r/708816

Change 708786 merged by Milimetric:

[analytics/refinery/source@master] Remove refinery-camus module

https://gerrit.wikimedia.org/r/708786

Change 708787 merged by Ottomata:

[analytics/refinery/source@master] Refine - replace default formatters with gobblin convention

https://gerrit.wikimedia.org/r/708787