We are going to move away from the *.cirrussearch.update_pipeline.update.rc0 topics that were useful as part of the development of the Search Update Pipeline.
Both schemas, stream & topic names should refer to a stable version now that the pipeline is running in production for all the wikis.
For this we need the two new v1 topics in each kafka-main clusters to be properly partitioned:
- eqiad.cirrussearch.update_pipeline.update.v1: 5 partitions
- codfw.cirrussearch.update_pipeline.update.v1: 5 partitions
Open question: we ask for 5 partitions because this is what we had for the rc topics, it was certainly set like this to spread the size across all 5 kafka nodes but on the consumption part we found that it was not ideal to have 5 because it required a min of 5 consumers (flink parallelism of 5 which is way more than we actually need) to spread evenly the workload. We were wondering if it would not make sense to reconsider this number to avoid a prime and possibly use 6 (or more?) partitions to allow for more flexibility? Let's revisit this as part of another task to avoid confusions.
Numbers:
- the volume is expected to be exactly the same as rc0 topics, the topic in the active DC is expected to be around 120Gb (incl. replication), receive between 250 & 500 events/sec in normal conditions but can see surge with 1k events/sec in case it's backfilling after an outage.
- we won't duplicate the data between rc0 and v1 topics so no additional space will be required for the transition
AC:
- Determine if 5 partitions is still what we want
- yes we might want to keep 5 for now and reconsider this as a separate task
- [eqiad|codfw].cirrussearch.update_pipeline.update.v1 topics are properly partitioned in kafka-main[eqiad|codfw]