With the current issue with Wikidata dumps (T386401), the recovery mechanism of Wikidata Query Service (WDQS) is at risk. Recovery relies on reloading data from dumps and catching up from our update stream. But that stream is currently only kept for 30 days.
Increasing retention of the WDQS RDF stream to 60 days would allow for more time to fix the Wikidata dumps. This increase should be permanent, to also mitigate future dumps issues. The increase could be done on Kafka, or on Hadoop if resources are not available on Kafka. Note that we currently don't have a mechanism to consume the update stream from Hadoop, so some additional implementation would be needed if this is the solution we choose.
Once a strategy is decided on, subtasks will be created to track actual work.
The topics are:
- [eqiad|codfw].rdf-streaming-updater.mutation
- [eqiad|codfw].rdf-streaming-updater.mutation-main
- [eqiad|codfw].rdf-streaming-updater.mutation-scholarly
- [eqiad|codfw].mediainfo-streaming-updater.mutation
in kafka-main.
If increasing the retention in kafka-main is not an option we could consider increasing the retention in kafka-jumbo (the automated tooling for wdqs data reloads does not support kafka-jumbo but in a disaster recovery scenario I believe we could manage to use it relatively quickly).


