General ideas:
- A custom flink source, embedded in the consumer, that handles the id generation phase (roughly, SaneitizeJobs.php). This should generate id's on timers to slowly produce all ids over a configured time period
- list of wikis should be sourced from consumer wikis parameter. When no wiki filter is in place source from sitematrix on metawiki. Or noc.wikimedia.org
- max page id should be queried from the wiki somehow. Alternatively could ask elasticsearch, but that is less of a guarantee.
- Expose the cirrussearch checker to the mediawiki api. Flink should call this api with the list of id's and receive the results of checking
- Potentially generate a side output of found errors, for future debugging
- Generate update events that flow into the consumer fetch phase that fix up found errors.