ChangeProp currently has some limited deduplication for transclusion-related re-renders.
Here's how it works right now:
- When the template is changed we are issuing a request to MW API to get 50 pages where the template was transcluded, posts individual jobs to re-render the pages and posts a new continuation event with increased sequence number.
- On every continuation event we check with an in-memory list of latests processed continuations and possibly deduplicate them (code)
Since the history of events is kept in memory and it's not a very long list, we loose some of the deduplication capabilities on restart and because we quite quickly forget about past events.
This task is created to consider options to add some sort of storage to ChangeProp to be used for de-duplication purposes.
Adding a storage would be the foundation for the later work on generalizing the deduplication to support JobQueue use-cases.
Basically, the storage needs to be able to hold an expiring map 'sha1' -> 'timestamp', so I propose to use Redis for that. Also Redis node drivers are pretty good: https://www.npmjs.com/package/redis
- Key-value map 'sha1' -> 'timestamp' with efficient automatic expiry.
- Support for a high rate (>100/s) of reads and writes per second. Most jobs will not find a pre-existing duplicate in the read, and will add a new entry once the job has been fully processed.
- Reliable and low-maintenance multi-datacenter operation.