Page MenuHomePhabricator

Investigate how prefetching labels would work for watchlist and recent changes
Closed, ResolvedPublic


Version: unspecified
Severity: normal
Whiteboard: u=dev c=backend p=5 s=2014-11-11


Related Gerrit Patches:

Related Objects

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:53 AM
bzimport set Reference to bz72309.
bzimport added a subscriber: Unknown Object (MLST).

Needs input from WMF-people. DanielK and/or aude are going to write an E-Mail.

Change 176378 had a related patch set uploaded (by Daniel Kinzler):
Introducing RecentChangesRowsForDisplay hook.


daniel added a subscriber: daniel.Nov 28 2014, 7:55 PM

Change 176378 introduces the RecentChangesRowsForDisplay hook, which would allow us to pre-fetch labels fro entities present in the RC rows.

Change 176378 abandoned by Daniel Kinzler:
Introducing RecentChangesRowsForDisplay hook.

ChangesListInitRows exists and should do

As Katie pointed out, the existing ChangesListInitRows hook should fit the bill already.
The rest of the investigation should be to outline which services/interfaces we need to make the pre-fetching work.

Here's a rough outline of the pre-fetching infrastructure for labels and other Terms:

We need a TermCache service like this:

TermCache {

    *  Update terms for the given entity. Any old terms associated with the entity are discarded.
    public function updateTerms( EntityId $entityId, Fingerprint $terms );

    *  Loads a set of terms into memory, for later use by getTerms()
    public function prefetchTerms( EntityId $entityId, Fingerprint $terms );

    *  Get terms of the given types for the given entities, in the given languages (or all languages).
    public function getTerms( EntityId[] $entityIds, $termTypes, $languages );

This interface is somewhat similar to the TermIndex interface. We should consider cleaning up TermIndex, and using that.

The TermCache makes use of a persistent cache (optionally shared between wikis) aka memcached, and local in-process caching. It would be used as follows:

  • ChangeHandler calls updateTerms() when it is notified of an Entity being updated. If the cache is shared, the repo does this immediately when the terms associated with an Entity are modified. In both cases, terms from the Fingerprint are placed into memcached using setMulti(), one entry per term. The cache key will contain the entity Id, term type, and language (plus possibly a wiki id, version id, etc). Multi-value terms (aliases) are stored as a list of values. A special key that does not include the type or language parts is used to store a list of all the keys used for a given entity, to allow these keys to be purged when updateTerms() is called again for the same entity.
  • A hook like ChangesListInitRows is used to trigger prefetchTerms(); prefetchTerms() uses getMulti() to fetch all the desired terms from memcached, and stores them locally in a hash. PROBLEM: if some terms are missing, we do not know whether they are uncached, or do not exist. Negative caching and/or checking the key list should be used. TBD: Decide whether TermCache should know how to fetch uncached terms.
  • A hook like LinkBegin may use a LabelLookup at is based upon a TermCache and TermLookup to get terms associated with item pages; If the desired label was previously fetched via prefetchTerms(), this should be very quick.

The main issue that remains to be decided is when and where cache misses are resolved (by looking at the wb_terms table); one complication is that it's unclear whether we should load all terms of an entity in such a case, or just the one we currently need.

This architecture is intended to minimize i/o volume as well as round trips to memcached. It still means putting a large number of small entries into memcached with no expiration, possibly swamping it and pushing out high usage entries with low usage labels. A "randomized put" strategy could be used to mitigate this issue by writing only every Kth entry to the cache, giving frequently used labels a higher chance of being cached.

daniel added a comment.EditedNov 30 2014, 6:42 PM

Marius brought up an idea for pre-fetching labels accessed via Lua or {{#property}} with arbitrary access enabled: we pre-fetch based on the usage tracking info, so that when parsing a new revision of a page, pre have fast access to all labels/terms used by the previous revision. The idea is that in most cases, the labels used will change little, if at all.

aude added a subscriber: aude.Nov 30 2014, 7:02 PM

LinkBegin should use an EntityRetrievingTermLookup (with in process caching) if terms cannot be batched and prefetched and it is not a memcached or redis-based term cache.

memcached retrieval of entities is faster than sql queries of term table.

aude added a comment.Nov 30 2014, 7:04 PM

if we use memcached, then suggest we cache entities that are in the recent changes table (maybe x recent days) plus perhaps most used entities as determined by client usage tracking.

daniel added a comment.Dec 1 2014, 2:04 PM

@aude: "memcached retrieval of entities is faster than sql queries of term table" <-- i'm not sure this is true, especially for large items. I have sent mail to Springle and Ori asking for input. Will do some benchmarking.

For course, fre-fetching/batching plus in-process caching is the Right Thing; the question at hand is if we want to switch to table lookups NOW, to preserve memory and perhaps also time.

Created T74310 "Batched label query for watchlists and recent changes" for tracking the caching aspect.

Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
aude added a comment.Dec 1 2014, 7:14 PM

only one data point comparison on my dev wiki and obviously can't reproduce all production conditions, but:

recent changes with EntityTermLookup (30 days, 60 items) - TermSqlIndex:

65955512 memory
28786 backend response time

recent changes with EntityRetrievingTermLookup (30 days, 60 items), few cache misses:

70434344 memory
20670 backend response time

recent changes with EntityRetrievingTermLookup (30 days, 60 items), have restarted memcached so many cache misses:

70634728 memory
33539 backend response time

neither is really a good choice. using TermSqlIndex queries are somewhat better in regards to memory usage, while EntityRetrivingTermLookup seems better if cache misses are few enough.

these results are also consistent with comparisons i had made on the EntityView (before we had EntityIntoTermLookup).

if we can somehow profile in production (with reasonable effort), would be good and interesting.

and overall, continue to work on batched lookup and caching in as many places as possible.

@aude: thanks for the benchmark! I'm surprised that loading the full entities does not have a bigger impact on memory usage. How many different entities where hit, and how big are these entities?

Lydia_Pintscher set Security to None.
Lydia_Pintscher added subscribers: JanZerebecki, hoo.
daniel closed this task as Resolved.Dec 9 2014, 9:52 PM
daniel claimed this task.