@Jgreen points out that Analytics has experience with a mature, high-availability Kafka deployment, so if it fits our use cases, we might consider using that instead of Redis.
Kafka is specialized for one-way log capture (1), which makes it an excellent match for our fredge event logging, and for getting messages off of the payments boxes. Due to PCI, payments boxes aren't even allowed to read from the public queues, so there's no question of eventually expanding to more use cases than Kafka supports. Payments ContributionTracking access is similarly write-only, and would be served well by the Kafka architecture.
All of our more complex requeueing and analytical, key-value access use cases take place outside of the payments cluster, where we can simply have kafka consumers processing the streams and storing in our RDBMS as before. We can continue using MySQL for this storage.
Redis offers a lot more flexibility, but the clustering isn't as mature. Redis 3 is still not a mainstream package for Debian or Ubuntu stable. Kafka clustering is already a known quantity and Analytics presumably has puppet we can crib.
Per IRC discussion, maybe Kafka for the queues that get info off of payments, but Redis where we need key-value stuff:
- limbo messages
- local to payments wiki
- pending donor data
- read these from Kafka into a key-value store to serve IPN listener and CRM
More things to consider, still through a Redis vs. Kafka perspective.
- Consumer doesn't have to be transactional because messages are retained and not destroyed after consumption. Messages will be consumed at-least-once.
- Clustering is mature and failover happens transparently.
- We can consider rewriting some of our queue consumers as real-time stream processors, reducing latency and lost execution time.
- No PHP-Queue implementation yet.
- Additional, large piece of infrastructure to maintain.
- Our limbo queues use the "delete" operation, so we would need to reimplement the antimessage antipattern, and/or have some other component delete pending messages e.g. when a completed donation comes in.
- Data retention can be chosen to be a specific amount of time, so will not be destroyed if there's a consumer outage during high traffic.
- Stores data in-memory, backed by disk.
- Writes are transactional, we can be sure the message is written.
I'm starting to see that Kafka can be the layer we use to decouple from any frontend stuff, but doesn't have to play all the roles that ActiveMQ fulfilled. For example, for a queue that needs to be randomly accessible using multiple indexes (e.g. "pending"), we could have a Kafka pipe from the frontend, but a stream processor copies these messages to a Redis store outside of the payments. What I want to get your opinion on is, whether this means that we can provision the Redis or MySQL server at a lower SLA, since a Redis outage in this case is decoupled from payments and will only affect the consuming jobs.
Well, @Ottomata made us aware of an obstacle, but it seems to be the only blocker so far. No one at the WMF uses PHP to
produce consume Kafka messages yet, so we would have to pick a client library and write the integration code from scratch. Pretty minor, but it does introduce a risk that the client library might not be production-ready.
No one at the WMF uses PHP to produce Kafka messages yet,
Ah! That's not what I said! No one uses PHP to consume Kafka messages. Produce, yes:
Some discouraging developments are documented in T130283#2216191.
We've decided to reopen the decision about which backend data store to use, nothing is emitting an aura of idealness. It may be that Redis 2 plus replication and a manual failover protocol is the most stable and sane path forward.
If we use Kafka, we'll want to stay in sync with @Ottomata's efforts to get WMF using Kafka 0.9. It probably won't win us many features however, because the PHP client library doesn't seem to support native authentication yet. ZooKeeper is required by the brokers, so that isn't simplified away, either.
Redis 3 doesn't have a record of stable production deployment that I'm aware of, so its automatic failover might be out of reach for now.
N.b.: Assume that the new queue servers will be provisioned with Debian jessie.
We're going to shuffle tasks around in order to focus on the not-at-risk code changes first, getting all of our client code on the same php-queue library and possibly enhancing with a Kafkaesque mode where messages are persistent and consumers are associated with offset pointers. This is necessary anyway, for open sourceness and to prevent us from coupling to yet another tenuous technology.
There isn't a clear next step for this task, so we should check in again soon to make sure we keep momentum.
I'll cast my vote for Redis 2. We can have a reasonable solution up in no time, and with the same manual failover characteristics as MySQL, the next point of failure.
In a year or two we might get an upgrade to automatic failover in Redis 3 for free or nearly so.
Please feel welcome, everyone, to cast an opinion or to abstain...