Page MenuHomePhabricator

Add storage to Change-Prop for deduplication
Closed, ResolvedPublic

Description

ChangeProp currently has some limited deduplication for transclusion-related re-renders.

Here's how it works right now:

  1. When the template is changed we are issuing a request to MW API to get 50 pages where the template was transcluded, posts individual jobs to re-render the pages and posts a new continuation event with increased sequence number.
  2. On every continuation event we check with an in-memory list of latests processed continuations and possibly deduplicate them (code)

Since the history of events is kept in memory and it's not a very long list, we loose some of the deduplication capabilities on restart and because we quite quickly forget about past events.

This task is created to consider options to add some sort of storage to ChangeProp to be used for de-duplication purposes.

Adding a storage would be the foundation for the later work on generalizing the deduplication to support JobQueue use-cases.

Basically, the storage needs to be able to hold an expiring map 'sha1' -> 'timestamp', so I propose to use Redis for that. Also Redis node drivers are pretty good: https://www.npmjs.com/package/redis

Requirements

  • Key-value map 'sha1' -> 'timestamp' with efficient automatic expiry.
  • Support for a high rate (>100/s) of reads and writes per second. Most jobs will not find a pre-existing duplicate in the read, and will add a new entry once the job has been fully processed.
  • Reliable and low-maintenance multi-datacenter operation.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo updated the task description. (Show Details)
Pchelolo removed a subscriber: Aklapper.

As the work goes I want to tackle this first. Adding Redis to change-prop would benefit us no matter what and it's a basement for further improving the quality of the de-duplication in ChangeProp.

I'd much appreciate an opinion from the SRE on this question.

GWicke subscribed.

I added a requirements section that more explicitly calls out what we are looking for in a storage backend.

Interesting solution from Netflix for multi-datacenter replication of Redis: https://github.com/Netflix/dynomite

We might need to add storage to ChangeProp not only for deduplication, but also for automatic page blacklisting, see T161710 for details.

Pchelolo edited projects, added Services (done); removed Services (later).

Redis was added to ChangeProp nodes and is already successfully used for blacklisting unparseable pages. This is done.