For memcached purges (and later CDN purges) to reliable happen both within and across datacenters we will need a pubsub system. Ori, Andrew, and I talking about using Kafka for this.
A modest 2 node setup in each DC is enough for HA, and the purge streams will consist of small JSON messages which are quickly consumed, so we should not need a huge amount of disk.
A few notes about the usage:
* The varnish and memcached nodes in codfw will be subscribed to the purge stream cluster in eqiad
* The varnish and memcached nodes in eqiad will be subscribed to the purge stream cluster in codfw for completeness and simpler fail-over (though little traffic should come this way)
* MediaWiki will be the only initial producer of memcached and varnish purge JSON messages
* Subscribers are all thus cross-DC
* Since producers always talk to local kakfa clusters, latency should not be an issue, though the DeferredUpdates class for MediaWiki can be put to use if needed
* Messages only convey purges, not new values, so they are very small
* The rate of purges is normally tied to the rate of editing across all sites, which is low (<< 100 hz)
* Maintenance scripts sometimes trigger lots of purges, which should still be fine, but is worth thinking about more than normal editing
* The cluster will likely be expanded and used for the larger event bus project down the road...
Since varnish and memcached themselves can handle high purge rates, it would be nice not to bottleneck them with the bus too much, even if we don't purge at high rates *normally*. There is a lot of room for discussion about disk type and RAM. I'd defer to Andrew on those.
Some notes were also logged at https://etherpad.wikimedia.org/p/KafkaPurge
The consumer "pull" logic would be ported from the redis prototype at https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fpython-cache-relay
Update of task from discussion:
eqiad (one of the below): comments from @gwicke support use of the spares for at least a year of projected use.
[x] - allocate a spare R610 single cpu out of warranty system, swap in the 250GB disks per @ottomata's request
[] - purchase a single cpu system, priced on T117240
codfw (one of the below): Neither option selected, as preference is dictated by what costs less in budget, ordering a new single cpu machine or using an overprovisioned spare.
[] - allocate an over-provisioned Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32 GB Memory, (2) 500GB Disks - we have 4 remaining, two out of warranty as of this year, and two that expire in January of 2016)
[] - purchase a single cpu system, priced on T117240