Hi folks!
Filippo has been testing Benthos as stream processor, and I gave it a try as well for some of the ML use cases. It looks really simple and powerful, and in my opinion it would be nice to have a tool like that on some VMs or on K8s. Since the streaming use case is a big one and there are a lot of things going on I opened a task so we can discuss what is best to do :)
- Filippo's use case: enrich webrequest_text with geoip data
The idea has been discussed several times, namely pulling messages from the Kafka topic webrequest_text, adding geoip data and then sending them back to a different Kafka topic. Druid then could pull messages from this new topics and add "real-time" segments to webrequest_128.
- ML use case - replace ChangeProp for mediawiki.revision-score events
The idea is to read messages from the mediawiki.revision-create Kafka topic, wrap every message in another JSON that Lift Wing knows how to process and call the Lift Wing api accordingly. I have created a simple config to process only enwiki events, and tested it on stat1004. If you are curious: https://phabricator.wikimedia.org/P35321
Some interesting details:
- TLS used for both Kafka and HTTP, works out of the box nicely.
- There are "rate-limit" configs to add to avoid overwhelming services. In my case, I added one to send only few req/second to Lift Wing.
It seems to me that Benthos could be used in several places, and it should be really easy to package (it is a go binary that can be debianized or built as part of a Docker image and configured via Helm).
What does Data Engineering think? If there are other alternatives let us know, we don't want a proliferation of Stream processors/tools but it would be nice to have something like Bethos for simple use cases.