Page MenuHomePhabricator

Set up a Cloud VPS Kafka Cluster with replicated eventbus production data
Closed, DeclinedPublic

Description

Often folks want to build and test event based services in Cloud VPS. However, the stream of events available (only in deployment-prep) is too low for many use cases (machine learning, WDQS index updates, ElasticSearch updates, etc.). Much of the eventbus data can and is being made public anyway (via EventStreams). EventStreams is good for external use cases where folks just need to listen to events. But internal production services will consume directly from Kafka and use features like timestamp offset seeking and offset commits, and perhaps Kafka's recent exactly once transactional guarantees.

To do this, we will need a maintained Kafka cluster that can replicate specific topics from kafka-jumbo. This cluster will need to be network accessible from (all?) Cloud VPS projects.

Use cases

  • T161731 WDQS reliable change stream for specific wiki
  • ORES development/testing machine learning models
  • ...

Event Timeline

I'm not suggesting that we do this soon, but should keep this in mind for future work. I think more and more folks will need this.

What resources would be needed to do this? Could we build this out as a Cloud VPS project or does it need dedicated hardware?

Hm, possibly. Something like 3 nodes, 2TB+ storage each, 32G+ RAM each, 12ish CPUs. I haven't loved maintaining this type of stuff on VPSs, stuff always seems to get rusty somehow. But it might be worth a try.

I don't know if we can set aside that much space on 3 different labvirts today or not, but we can check. The disk is probably the hardest part to figure out how to fit without starving other projects.

On the flip side, those would probably be relatively cheap dedicated boxes to spec out.

Milimetric subscribed.

we can help with this but declining as we try to focus our backlog