This task tracks the implementation of "realtime webrequest" feature.
20220810
Summary/notes of half an hour chat with @BTullis @CDanis @Ottomata and myself:
- Conceptually "realtime sampled webrequest" is similar to what we're doing with netflow (from a druid/turnilo POV)
- Having a topic already sampled (and processed/augmented) would make things easy on the druid/turnilo side, and we could drop the current hourly webrequest sampled datasource in turnilo, while keeping the realtime datasource only.
- Such canonical sampled topic is generally useful to peek for SREs and other users, thus something we'd want
Then the question is, how to generate said sampled topic? Note that webrequest upload and text topics will need to be combined. The basic operations needed are:
- read from kafka topics
- sample the streams
- as the first iteration we'd need basic augmentation for operational investigation purposes (i.e. geoip AS lookup)
- write back the combined stream back into kafka as the sampled topic
A few options have been discussed, including:
- Event/stream processing via Flink and/or event platform. Upsides include standardization with other streaming processing at the foundation, and being able to re-use the augmentation logic we're currently using in druid generate_hourly_druid_webrequests.hql https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/refine_webrequest.hql
- Since this is relatively simple stream processing (and quite importantly, stateless) we could get away with simpler/easier solutions like https://www.benthos.dev (deployed in k8s for example)