Page MenuHomePhabricator

Spike: Choose a new backend for queueing
Closed, ResolvedPublic1 Estimated Story PointsSpike

Description

@Jgreen points out that Analytics has experience with a mature, high-availability Kafka deployment, so if it fits our use cases, we might consider using that instead of Redis.

Draft of our evaluation:
https://docs.google.com/spreadsheets/d/1V2UpHdTH4FaTRQ3SZOoTikyA8HHsjTEYwxMNHu9YsCI/edit?ts=5734dd9c#gid=0

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
DuplicateNone
ResolvedEjegg
OpenNone
ResolvedEjegg
DeclinedJgreen
ResolvedNone
ResolvedNone
ResolvedJgreen
ResolvedNone
DuplicateNone
OpenSpikeNone
ResolvedXenoRyet
ResolvedEjegg
ResolvedEjegg
ResolvedNone
Resolvedawight
ResolvedSpikeNone
InvalidNone
ResolvedXenoRyet
DuplicateNone
ResolvedNone
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
DuplicateNone
ResolvedNone
ResolvedEjegg
Resolvedawight
Resolvedawight
ResolvedEjegg
Resolvedawight
OpenSpikeNone
ResolvedNone
ResolvedEjegg
ResolvedJgreen
ResolvedEjegg
ResolvedEjegg
InvalidNone

Event Timeline

awight lowered the priority of this task from High to Medium.Mar 17 2016, 9:19 PM
DStrine set the point value for this task to 1.Mar 22 2016, 7:10 PM
DStrine moved this task from Triage to Sprint +1 on the Fundraising-Backlog board.

Kafka is specialized for one-way log capture (1), which makes it an excellent match for our fredge event logging, and for getting messages off of the payments boxes. Due to PCI, payments boxes aren't even allowed to read from the public queues, so there's no question of eventually expanding to more use cases than Kafka supports. Payments ContributionTracking access is similarly write-only, and would be served well by the Kafka architecture.

All of our more complex requeueing and analytical, key-value access use cases take place outside of the payments cluster, where we can simply have kafka consumers processing the streams and storing in our RDBMS as before. We can continue using MySQL for this storage.

Redis offers a lot more flexibility, but the clustering isn't as mature. Redis 3 is still not a mainstream package for Debian or Ubuntu stable. Kafka clustering is already a known quantity and Analytics presumably has puppet we can crib.

Per IRC discussion, maybe Kafka for the queues that get info off of payments, but Redis where we need key-value stuff:

  • limbo messages
    • local to payments wiki
  • pending donor data
    • read these from Kafka into a key-value store to serve IPN listener and CRM

More things to consider, still through a Redis vs. Kafka perspective.

Kafka pros

  • Consumer doesn't have to be transactional because messages are retained and not destroyed after consumption. Messages will be consumed at-least-once.
  • Clustering is mature and failover happens transparently.
  • We can consider rewriting some of our queue consumers as real-time stream processors, reducing latency and lost execution time.

Kafka cons

  • No PHP-Queue implementation yet.
  • Additional, large piece of infrastructure to maintain.
  • Our limbo queues use the "delete" operation, so we would need to reimplement the antimessage antipattern, and/or have some other component delete pending messages e.g. when a completed donation comes in.

Kafka same

  • Data retention can be chosen to be a specific amount of time, so will not be destroyed if there's a consumer outage during high traffic.
  • Stores data in-memory, backed by disk.
  • Writes are transactional, we can be sure the message is written.

@Jgreen
I'm starting to see that Kafka can be the layer we use to decouple from any frontend stuff, but doesn't have to play all the roles that ActiveMQ fulfilled. For example, for a queue that needs to be randomly accessible using multiple indexes (e.g. "pending"), we could have a Kafka pipe from the frontend, but a stream processor copies these messages to a Redis store outside of the payments. What I want to get your opinion on is, whether this means that we can provision the Redis or MySQL server at a lower SLA, since a Redis outage in this case is decoupled from payments and will only affect the consuming jobs.

Rather than provisioning more systems, could we add a database on
fundraisingdb cluster?

Change 279744 had a related patch set uploaded (by Awight):
[WIP] Mirror to Kafka

https://gerrit.wikimedia.org/r/279744

Change 279745 had a related patch set uploaded (by Awight):
[WIP] looking at direct kafka integration.

https://gerrit.wikimedia.org/r/279745

Well, @Ottomata made us aware of an obstacle, but it seems to be the only blocker so far. No one at the WMF uses PHP to produce consume Kafka messages yet, so we would have to pick a client library and write the integration code from scratch. Pretty minor, but it does introduce a risk that the client library might not be production-ready.

awight moved this task from Review to Done on the Fundraising Sprint Freshmaking board.

No one at the WMF uses PHP to produce Kafka messages yet,

Ah! That's not what I said! No one uses PHP to consume Kafka messages. Produce, yes:

Whoops, thanks for catching that slip! Yes, that's what I heard, just not what I wrote ;-)

Some discouraging developments are documented in T130283#2216191.

We've decided to reopen the decision about which backend data store to use, nothing is emitting an aura of idealness. It may be that Redis 2 plus replication and a manual failover protocol is the most stable and sane path forward.

If we use Kafka, we'll want to stay in sync with @Ottomata's efforts to get WMF using Kafka 0.9. It probably won't win us many features however, because the PHP client library doesn't seem to support native authentication yet. ZooKeeper is required by the brokers, so that isn't simplified away, either.

Redis 3 doesn't have a record of stable production deployment that I'm aware of, so its automatic failover might be out of reach for now.

N.b.: Assume that the new queue servers will be provisioned with Debian jessie.

We're going to shuffle tasks around in order to focus on the not-at-risk code changes first, getting all of our client code on the same php-queue library and possibly enhancing with a Kafkaesque mode where messages are persistent and consumers are associated with offset pointers. This is necessary anyway, for open sourceness and to prevent us from coupling to yet another tenuous technology.

There isn't a clear next step for this task, so we should check in again soon to make sure we keep momentum.

@cwdent is throwing Postgres into the ring... we can also consider MySQL, all the queuelike things we do are easy to emulate.

News flash: MySQL eliminated as a candidate, by firing squad.

awight renamed this task from Spike: Investigate suitability of Kafka instead of Redis to Spike: Choose a new backend for queueing.Apr 25 2016, 9:04 PM
awight removed awight as the assignee of this task.May 4 2016, 9:53 PM

Change 279744 abandoned by Awight:
[WIP] Mirror to Kafka

https://gerrit.wikimedia.org/r/279744

I'll cast my vote for Redis 2. We can have a reasonable solution up in no time, and with the same manual failover characteristics as MySQL, the next point of failure.

In a year or two we might get an upgrade to automatic failover in Redis 3 for free or nearly so.

Please feel welcome, everyone, to cast an opinion or to abstain...

I'm provisionally decreeing Redis 2 to be the winner. Please lodge any final complaints here, before the end of the sprint...

Change 279745 abandoned by Awight:
[WIP] looking at direct kafka integration.

https://gerrit.wikimedia.org/r/279745

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptFeb 13 2020, 9:47 PM