For memcached purges (and later CDN purges) to reliable happen both within and across datacenters we will need a pubsub system. Ori, Andrew, and I talking about using Kafka for this.
A modest 3 node setup in each DC is enough for HA, and the purge streams will consist of small JSON messages which are quickly consumed, so we should not need a huge amount of disk.
A few notes about the usage:
* The varnish and memcached nodes in codfw will be subscribed to the purge stream cluster in eqiad
* The varnish and memcached nodes in eqiad will be subscribed to the purge stream cluster in codfw for completeness and simpler fail-over (though little traffic should come this way)
* MediaWiki will be the only initial producer of memcached and varnish purge JSON messages
* Subscribers are all thus cross-DC
* Since producers always talk to local kakfa clusters, latency should not be an issue, though the DeferredUpdates class for MediaWiki can be put to use if needed
* Messages only convey purges, not new values, so they are very small
* The rate of purges is normally tied to the rate of editing across all sites, which is low (<< 100 hz)
* Maintenance scripts sometimes trigger lots of purges, which should still be fine, but is worth thinking about more than normal editing
* The cluster will likely be expanded and used for the larger event bus project down the road...
Since varnish and memcached themselves can handle high purge rates, it would be nice not to bottleneck them with the bus too much, even if we don't purge at high rates *normally*. There is a lot of room for discussion about disk type and RAM. I'd defer to Andrew on those.
Some notes were also logged at https://etherpad.wikimedia.org/p/KafkaPurge