A problem statement
New WDQS Flink-based updater uses Kafka to communicate between parts of the process. To effectively bootstrap new WDQS instances, we need to be able to copy data from already bootstrapped instances to the new ones. This process requires copying both the Blazegraph journal data and kafka offsets. We already got the first part, but we need to be able to transfer offsets between different Kafka consumer groups and cluster (for a cross DC transfer).
This proposal is about introducing a Spicerack module to do just that. The process itself is independent from our need, it can be reused by anybody - hence the idea to make it into Spicerack. This Spicerack module will use puppet generated config to handle any potential changes in Kafka configuration.
New module (Kafka) will introduce the method:
transfer_kafka_position(self, topics: List[str],from_site: str, from_cluster: str, from_consumer_group: str,to_site: str, to_cluster: str, to_consumer_group: str)
,which will allow the transfer from a given cluster, site and consumer group to another. For internal cluster transfers, simple offset will be transferred. For cross cluster ones, offset will be approximated through the original timestamp.
Third party dependencies
python-kafka - version 1.4.3 (available in Buster) is enough.
Relevant task
More complete story on our needs - T276469