It would be useful to have a stream that supports the addition and deletion of CirrusSearch weighted_tags.
The stream would allow users willing to tag/un-tag pages in the search index to simply emit events to this stream.
There might be 2 different use-cases to support:
- realtime processes bound to the lifecycle of the page
- batch processes possibly sending a large number of modification
We might consider exposing 2 different streams giving us the opportunity to route or throttle the events accordindly:
- events bound to the lifecycle of the page might enter the merge window of the SUP producer so that they get a chance to be joined with other events related to the same edit
- events produced in batch might skip that window and possibly be throttled (if deemed necessary) to limit the impact on latencies of the realtime events.
For now, we start with a single steam.
AC:
- [x] define a schema for this stream
- [x] define a stream config
- [ ] create kafka topics (1 partition, 7 days retention):
-- eqiad.mediawiki.cirrussearch.page_weighted_tags_change.rc0
-- codfw.mediawiki.cirrussearch.page_weighted_tags_change.rc0
- [x] adapt the SUP producer to read these streams
-- possibly consider using [[https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment|watermark alignment]] and see if this helps the case where the batch stream might produce a lot of events at once
- [ ] adapt the [[https://wikitech.wikimedia.org/wiki/Search/WeightedTags documentation on wikitech]]
- [ ] adapt existing users of weitghed_tags to use this stream:
-- Growth using `\CirrusSearch\Updater::[update|reset]WeightedTags`
-- Image recommendation using hive partition