We now have several hive tables with aggregated sanitized data, like pageview_hourly, and more will no doubt follow. We also need a high level overview of overall traffic with every request served by us accounted for. Data could be tagged (and broken down e.g. by mime type, being pageview etc), but no filtering whatsoever. Doing a 1:100 sampled hive query would suffice.
This will allow us to monitor whether the filters that we use for other tables may be losing touch with evolving reality, so that we reject too much.
Also it can help us to track amount of suspicious traffic (botnet etc).
A simple report (for internal use) could tell us if percentage of sanitized page views is changing.