The problem:
EventLoggingToDruid does not use RefineTarget to determine which data pieces
are available at a given moment and are to be loaded to Druid, because
currently RefineTarget does not support Druid. Instead, EventLoggingToDruid
just assumes that the passed date/time interval is correct and loads it
without any check or filter. The interval checking needs to be done then by
puppet (cron), passing a relative number of hours ago as since and until.
Potential issues:
- If the data pipeline is late for any reason (high load, outage, restarts, etc.) EventLoggingToDruid might not find the input data, or find it incomplete, thus loading corrupt data to Druid for that hour.
- If the cluster is busy and the EventLoggingToDruid job takes more than 1 hour to launch (waiting), then 'since 6 hours ago' will skip 1 hour (or more) and there will be a hole in the corresponding Druid datasource.
This would cause user confusion, frustration and give the maintainers lots
of work to manually backfill datasources.
The right solution:
Allow RefineTarget to deal with Druid segments