Page MenuHomePhabricator

WANObjectCache relay daemon or mcrouter support
Closed, ResolvedPublic

Description

Option I:
Support mcrouter for broadcasting cross-DC purges.

Option II:
There is gerrit repo, https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/python-cache-relay, which has the code that handles relaying purges for the WANObjectCache.

This should be ported to using Kafka and should support:
a) memcached purges
b) CDN (varnish) purges

MediaWiki should emit events as needed.

Related Objects

Event Timeline

aaron assigned this task to ori.
aaron raised the priority of this task from to Medium.
aaron updated the task description. (Show Details)
aaron added a project: Sustainability.
aaron added subscribers: Gilles, MZMcBride, aaron and 4 others.
aaron set Security to None.

The code is (mostly) correct but it needs to be improved dramatically, in my opinion, if you want maintenance responsibilities to extend beyond yourself. I think it can be considered complete without documentation and tests.

In terms of overall design, in my opinion there should be a Relay interface (consider using the abc module), with a subclass for each backend (Redis, memcached and CDN). Let dynamic dispatch do the job of relay_cache_command.

Change 261595 had a related patch set uploaded (by Aaron Schulz):
Make CDN purges send EventRelayer events

https://gerrit.wikimedia.org/r/261595

Change 261595 merged by jenkins-bot:
Make CDN purges send EventRelayer events

https://gerrit.wikimedia.org/r/261595

Change 286588 had a related patch set uploaded (by Smalyshev):
[WIP] Add configs for kafka-watcher tool

https://gerrit.wikimedia.org/r/286588

Hey, not to be annoying, but eventlogging already does most of what you have coded, and takes pluggable handlers.

You can run your own eventlogging instances

So, as far as memcached is concerned, I think that the best way to serve purges to both datacenters is using mcrouter, a thing I am working on right now (see T132317). I am unsure if this software is a better solution than the current purging mechanism for varnishes, but as far as memcached is concerned, we should really go the mcrouter way IMO.

OK, that's already three options to do the same thing. Maybe we should have a meeting and decide which way we're going.

@Joe looking at mcrouter, it's for memcached only, so we either have to have separate message propagation system for memcache only, or have some kind of router still to get messages from Kafka to memcache.

It would be nice to handle CDN and wan cache purges with one little system (e.g. kafka-watcher).

I'm also not sure how to do wan cache "check keys" with mcrouter without it being more vulnerable to race conditions (e.g. if broadcasted wildcard SETs where used). It would, however, handle DELETE more or less the same though (assuming the tombstone time is configurable and mcrouter nodes don't fail or get rotated - as they don't replicate logs like Kafka). "lockTSE" won't play as well with mcrouter delete()s though, at least not without some trickery.

In any case, if need be, it's workable enough.

@aaron I agree that your goal of propagating purges across the cluster simply is desirable; I see anyways that we have at least two software written in-hour already to handle events from kafka: change-propagation and eventbus; I would like us to explore how hard would it be to use one of those (in particular change-propagation) to handle purges propagation. I am pretty sure change-prop already does that for some systems, btw; @mobrovac can confirm

We have created the change-propagation service and the resource_change Kafka topic for this purpose (amongst other uses). The idea is that any entity that needs a resource to be purged would emit an event to that topic (MW, RB, etc). That is picked up by the change-propagation service which would send a request to Varnish. No Kafka-specific code is needed at all, since MW can send requests to the HTTP proxy service which validates and enqueues the events.

The system is flexible enough to accommodate any dependent updates you need. Events sent to resource_change can be tagged, and based on that tag we can construct a rule to also execute other requests apart from the purge itself.

After some discussion we've decided to try out integrating with resource_change infrastructure. I'll create resulting subtasks.

Change 286588 abandoned by Smalyshev:
[WIP] Add configs for kafka-watcher tool

Reason:
Change of course - we'll try to integrate with recent-changes infrastructure

https://gerrit.wikimedia.org/r/286588

Here is the table of WAN cache operations.
See: https://doc.wikimedia.org/mediawiki-core/master/php/WANObjectCache_8php.html

Note that this assumes https://gerrit.wikimedia.org/r/#/c/304311/ is merged, which tweaks minor set() and lock() calls.

WAN cache opUnderling memcached opsop scope
get()/getMulti()GETS, ADD (check keys)local DC
set()GET, ADD/CAS (protects tombstones)local DC
getWithSetCallback()ops in get()/set(), ADD/TOUCHlocal DC
getCheckKeyTime()GET/ADDlocal DC
delete()SET (tombstone)all DCs
touchCheckKey()SET (tombstone)all DCs
resetCheckKey()DELETEall DCs

Thus we can tabulate operations into strictly local or global:

DC-local opsDC-global ops
GETSET
GETSDELETE
CAS
ADD
TOUCH

Per https://github.com/facebook/mcrouter/wiki/List-of-Route-Handles , mcrouter nodes can be configured to broadcast SET/DELETE but use local hash routing for everything else.

The only thing I don't like is that while DELETE will use a reliable disk stream on failure but not on SET failures (e.g. for tombstones). OTOH, by default, nothing applies the stream anyway, so it's normally just an error log of sorts.

aaron renamed this task from Get cache relay daemon reviewed and usable to WAN cache relay daemon (possibly mcrouter).Aug 12 2016, 4:08 AM
aaron renamed this task from WAN cache relay daemon (possibly mcrouter) to WANOjectCache relay daemon (possibly mcrouter).Aug 12 2016, 4:53 AM
aaron renamed this task from WANOjectCache relay daemon (possibly mcrouter) to WANObjectCache relay daemon (possibly mcrouter).Aug 12 2016, 5:13 AM
aaron moved this task from Backlog to Doing on the Wikimedia-Multiple-active-datacenters board.

Change 304311 had a related patch set uploaded (by Aaron Schulz):
objectcache: add mcrouter support to WANObjectCache

https://gerrit.wikimedia.org/r/304311

Change 304311 merged by jenkins-bot:
objectcache: add mcrouter support to WANObjectCache

https://gerrit.wikimedia.org/r/304311

aaron renamed this task from WANObjectCache relay daemon (possibly mcrouter) to WANObjectCache relay daemon or mcrouter support.Oct 11 2016, 8:10 PM
aaron closed this task as Resolved.
aaron updated the task description. (Show Details)

Closing this unless mcrouter turns out not to work out.

I'm reopening this since the status of the FLOSS mcrouter project in the last year has been dire:

  • It's one year (!!!) they don't have a release
  • There is no indication of what could be stable or not
  • The build has changed radically between 0.24.0 (the version I packaged) and the current 0.36.0 (the last tagged version, already 1 year old), and it's broken again. I had to spend almost a week making the first build behave, and it seems I'd need to spend a similar amount of time this time around.

This circles me back to looking at alternatives. @aaron I'm taking a look at Netflix's dynomite, which I'm not sure would do what we want exactly, but right now the situation of the FLOSS version of mcrouter is not such I can endorse its production use.

In T97562#3977706, @Joe wrote:

I'm reopening this since the status of the FLOSS mcrouter project in the last year has been dire:

  • It's one year (!!!) they don't have a release
  • There is no indication of what could be stable or not
  • The build has changed radically between 0.24.0 (the version I packaged) and the current 0.36.0 (the last tagged version, already 1 year old), and it's broken again. I had to spend almost a week making the first build behave, and it seems I'd need to spend a similar amount of time this time around.

This circles me back to looking at alternatives. @aaron I'm taking a look at Netflix's dynomite, which I'm not sure would do what we want exactly, but right now the situation of the FLOSS version of mcrouter is not such I can endorse its production use.

This was evaluated a year ago or so. It should be usable (barring any unforseen problems). I would just need that to be packaged so I can write a puppet patch to deploy it on labs (and so on). The relaying of cache purges is a Q3 goal for us; is that a time-frame that the ops team can work within? Given how much easier it is to build than mcrouter, I'm hoping that packaging would be less awful.

For reference, there is T156938 , for evaluating dynomite.

Mediawiki supports mcrouter (already working for group0 wikis, in the process of being deployed more widely)