I don't yet fully understand what happened here, but on June 7th we switched from kafka.message.timestamp.type from LogAppendTime to CreateTime in order to support event timestamp for EventBus messages. This worked fine, but somehow on June 14th, 7 days (our default Kafka topic retention period) after we made this change, the Kafka log roller seemed to truncated the previous full weeks worth of logs, causing Camus to get upset. Camus had stored offsets that were now out of range of what was available in Kafka. Without kafka.move.to.earliest.offset set in camus.properties, Camus will choose to abort instead of try to figure out what to do. These errors were appearing in the Camus logs:
Please check whether kafka cluster configuration is correct. You can also specify config parameter: kafka.move.to.earliest.offset to start processing from earliest kafka metadata offset. 18/06/14 16:05:10 ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset, eventlogging_SaveTiming uri:tcp://kafka-jumbo1005.eqiad.wmnet:9092 leader:1005 partition:0 earliest_offset:15032211 offset:15025121 latest_offset:15035066 avg_msg_size:568 estimated_size:5648760 Topic eventlogging_SaveTiming will be skipped.
This has been happening ever since June 14th for many EventLogging topics. I'm not yet sure if others were affected. I just merged a patch to set kafka.move.to.earliest.offset=true, and am running Camus not for eventlogging topics. This will fix their imports and cause the last 7 days (since June 28) to begin importing, but we will need to backfill for any affected topics for data between June 14 - June 28.
Luckily we still have this date range in the files on eventlog1002 in /srv/log/eventlogging/archive.