Page MenuHomePhabricator

Exclude canary events from refined event Hive tables
Closed, DeclinedPublic


We will be enabling canary events for more and more streams in T266798: Enable canary events for all streams. These canary events are needed for ingestion monitoring, but is there any reason to include them in the Refined event tables? All we need is to be able to at least write the _REFINED flag into the hourly Hive partitions. If we filter out canary events for hours with no data, we should still get empty hours but with a complete _REFINED flag.

Users won't have to manually exclude the canary events in their Hive queries if we do this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

OH! we are already currently doing this for EventLogging data via the filter_allowed_domains transform function. Nothing to do here!

Actually, we are not using filter_allowed_domains for non-EventLogging data, and we had intended to not do so. Instead, we wanted to add a field to the refined data 'is_wikimedia_domain` or something, that would allow users to filter out bad data when they query. When we do this, we should also use the same field to filter canary events. I think we can leave the canary events in, but instruct users to filter on this TBD field, which will be set to false when meta.domain = 'canary'.