Page MenuHomePhabricator

Add storage to Change-Prop for deduplication
Closed, ResolvedPublic


ChangeProp currently has some limited deduplication for transclusion-related re-renders.

Here's how it works right now:

  1. When the template is changed we are issuing a request to MW API to get 50 pages where the template was transcluded, posts individual jobs to re-render the pages and posts a new continuation event with increased sequence number.
  2. On every continuation event we check with an in-memory list of latests processed continuations and possibly deduplicate them (code)

Since the history of events is kept in memory and it's not a very long list, we loose some of the deduplication capabilities on restart and because we quite quickly forget about past events.

This task is created to consider options to add some sort of storage to ChangeProp to be used for de-duplication purposes.

Adding a storage would be the foundation for the later work on generalizing the deduplication to support JobQueue use-cases.

Basically, the storage needs to be able to hold an expiring map 'sha1' -> 'timestamp', so I propose to use Redis for that. Also Redis node drivers are pretty good:


  • Key-value map 'sha1' -> 'timestamp' with efficient automatic expiry.
  • Support for a high rate (>100/s) of reads and writes per second. Most jobs will not find a pre-existing duplicate in the read, and will add a new entry once the job has been fully processed.
  • Reliable and low-maintenance multi-datacenter operation.

Event Timeline

Pchelolo created this task.Feb 3 2017, 12:16 AM
Restricted Application added a project: Analytics. · View Herald TranscriptFeb 3 2017, 12:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Pchelolo updated the task description. (Show Details)
Pchelolo removed a subscriber: Aklapper.
Restricted Application added a project: Analytics. · View Herald TranscriptFeb 3 2017, 12:16 AM

As the work goes I want to tackle this first. Adding Redis to change-prop would benefit us no matter what and it's a basement for further improving the quality of the de-duplication in ChangeProp.

I'd much appreciate an opinion from the Operations on this question.

GWicke updated the task description. (Show Details)Feb 3 2017, 8:29 PM
GWicke added a subscriber: GWicke.

I added a requirements section that more explicitly calls out what we are looking for in a storage backend.

Interesting solution from Netflix for multi-datacenter replication of Redis:

We might need to add storage to ChangeProp not only for deduplication, but also for automatic page blacklisting, see T161710 for details.

elukey added a subscriber: elukey.Jul 5 2017, 7:58 AM
Pchelolo closed this task as Resolved.Jul 11 2017, 8:09 PM
Pchelolo edited projects, added Services (done); removed Services (later).

Redis was added to ChangeProp nodes and is already successfully used for blacklisting unparseable pages. This is done.