Spike: Choose a new backend for queueing
Closed, ResolvedPublic1 Estimated Story PointsSpike
Actions

Description

@Jgreen points out that Analytics has experience with a mature, high-availability Kafka deployment, so if it fits our use cases, we might consider using that instead of Redis.

Draft of our evaluation:
https://docs.google.com/spreadsheets/d/1V2UpHdTH4FaTRQ3SZOoTikyA8HHsjTEYwxMNHu9YsCI/edit?ts=5734dd9c#gid=0

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] looking at direct kafka integration.	wikimedia/fundraising/crm	master	+22 -1
	[WIP] Mirror to Kafka	mediawiki/extensions/DonationInterface	master	+48 -2

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T182030 Scope contribution tracking
Duplicate		None	T158009 [Epic] Contribution tracking reform
Resolved		Ejegg	T86253 Make Contribution Tracking not a SPOF
Open		None	T119813 Make contribution_source into a proper thing or retire
Resolved		Ejegg	T119556 [epic] SPOF: Use Redis as backend store for contribution_tracking
Declined		Jgreen	T120464 Deploy Redis 3 to frack
Resolved		None	T117466 Q3 GOALS! (January-March) Keep at top of Q3 column
Resolved		None	T108229 [Epic] SPOF: Replace ActiveMQ donation queues with a more robust software stack
Resolved		Jgreen	T130283 Provision Redis cluster for Fundraising
Resolved		None	T133108 [Epic] Rewrite all queue clients to use a single shim library, improve library
Duplicate		None	T131279 [Epic] Migrate contribution tracking to new queue
Open	Spike	None	T131278 Spike: Investigate potential for banner impressions rewrite
Resolved		XenoRyet	T131277 Migrate donations to new queue
Resolved		Ejegg	T131274 Migrate pending consumers to new queue and finish cleanup
Resolved		Ejegg	T131273 Migrate fredge to new queue
Resolved		None	T131271 [Epic] Consolidate queue abstractions
Resolved		awight	T133964 Implement AtomicReadBuffer for choice of PHP-Queue backend
Resolved	Spike	None	T130304 Spike: Choose a new backend for queueing
Invalid		None	T112832 CRM consumers should use the PHPQueue library
Resolved		XenoRyet	T130308 Python fundraising-tools queue module should be abstracted and support Redis
Duplicate		None	T129386 Create SmashPig PhpQueueDataStore compatible with StompDataStore
Resolved		None	T131282 Spike: Design delay handling
Resolved		awight	T133190 Remove KeyValueStore from PHP-Queue
Resolved		None	T130897 [Epic] Consolidate "pending" queue usages
Resolved		awight	T131275 [Epic] Move orphan rectifier out of payments
Resolved		awight	T141486 Rewrite orphan rectifier to use the pending database and WmfFramework
Duplicate		None	T143942 Delete from the pending database when a transaction is imported into the database
Resolved		None	T143944 Expire pending database entries at some point
Resolved		Ejegg	T143945 Delete from the pending database when a transaction is failed or completed
Resolved		awight	T144562 PayPal's payments-init messages are inaccurate
Resolved		awight	T145846 New orphan rectifier balks on CVV="0"
Resolved		Ejegg	T145847 Orphan rectifier stores wrong source_* fields
			Restricted Task
Resolved		awight	T141487 Run the orphan rectifier job from CRM Jenkins
Open	Spike	None	T143831 Epic: Reconcile SmashPig and DonationInterface configuration
Resolved		None	T133195 [Epic] All pending producers should write to a single pending queue
Resolved		Ejegg	T133197 Write pending queue consumer and schema
Resolved		Jgreen	T133433 [Epic] Deprecate old pending queues
Resolved		Ejegg	T140484 CRM reads 'completion messages' from pending DB and ActiveMQ, scream if no match
Resolved		Ejegg	T122641 Quit looking in pending queue for completion message information
Invalid		None	T133248 SmashPig CI should run phpunit

Event Timeline

awight created this task.Mar 17 2016, 9:17 PM

awight lowered the priority of this task from High to Medium.Mar 17 2016, 9:19 PM

• DStrine set the point value for this task to 1.Mar 22 2016, 7:10 PM

• DStrine moved this task from Triage to Sprint +1 on the Fundraising-Backlog board.

Kafka is specialized for one-way log capture (1), which makes it an excellent match for our fredge event logging, and for getting messages off of the payments boxes. Due to PCI, payments boxes aren't even allowed to read from the public queues, so there's no question of eventually expanding to more use cases than Kafka supports. Payments ContributionTracking access is similarly write-only, and would be served well by the Kafka architecture.

All of our more complex requeueing and analytical, key-value access use cases take place outside of the payments cluster, where we can simply have kafka consumers processing the streams and storing in our RDBMS as before. We can continue using MySQL for this storage.

Redis offers a lot more flexibility, but the clustering isn't as mature. Redis 3 is still not a mainstream package for Debian or Ubuntu stable. Kafka clustering is already a known quantity and Analytics presumably has puppet we can crib.

Per IRC discussion, maybe Kafka for the queues that get info off of payments, but Redis where we need key-value stuff:

limbo messages
- local to payments wiki
pending donor data
- read these from Kafka into a key-value store to serve IPN listener and CRM

awight moved this task from Sprint +1 to Current Sprint & Completed in Q3 1516 on the Fundraising-Backlog board.Mar 23 2016, 7:52 PM

awight added a project: Fundraising Sprint Freshmaking.Mar 23 2016, 8:03 PM

awight moved this task from Backlog to Doing on the Fundraising Sprint Freshmaking board.Mar 23 2016, 9:08 PM

More things to consider, still through a Redis vs. Kafka perspective.

Kafka pros

Consumer doesn't have to be transactional because messages are retained and not destroyed after consumption. Messages will be consumed at-least-once.
Clustering is mature and failover happens transparently.
We can consider rewriting some of our queue consumers as real-time stream processors, reducing latency and lost execution time.

Kafka cons

No PHP-Queue implementation yet.
Additional, large piece of infrastructure to maintain.
Our limbo queues use the "delete" operation, so we would need to reimplement the antimessage antipattern, and/or have some other component delete pending messages e.g. when a completed donation comes in.

Kafka same

Data retention can be chosen to be a specific amount of time, so will not be destroyed if there's a consumer outage during high traffic.
Stores data in-memory, backed by disk.
Writes are transactional, we can be sure the message is written.

awight added a parent task: T130283: Provision Redis cluster for Fundraising.Mar 24 2016, 10:48 PM

@Jgreen
I'm starting to see that Kafka can be the layer we use to decouple from any frontend stuff, but doesn't have to play all the roles that ActiveMQ fulfilled. For example, for a queue that needs to be randomly accessible using multiple indexes (e.g. "pending"), we could have a Kafka pipe from the frontend, but a stream processor copies these messages to a Redis store outside of the payments. What I want to get your opinion on is, whether this means that we can provision the Redis or MySQL server at a lower SLA, since a Redis outage in this case is decoupled from payments and will only affect the consuming jobs.

Rather than provisioning more systems, could we add a database on
fundraisingdb cluster?

That works for me!

Change 279744 had a related patch set uploaded (by Awight):
[WIP] Mirror to Kafka

https://gerrit.wikimedia.org/r/279744

gerritbot added a project: Patch-For-Review.Mar 27 2016, 6:18 AM

Change 279745 had a related patch set uploaded (by Awight):
[WIP] looking at direct kafka integration.

https://gerrit.wikimedia.org/r/279745

awight moved this task from Doing to Review on the Fundraising Sprint Freshmaking board.Mar 30 2016, 12:03 AM

Well, @Ottomata made us aware of an obstacle, but it seems to be the only blocker so far. No one at the WMF uses PHP to ~~produce~~ consume Kafka messages yet, so we would have to pick a client library and write the integration code from scratch. Pretty minor, but it does introduce a risk that the client library might not be production-ready.

• DStrine added a project: FR-ActiveMQ.Mar 30 2016, 6:28 PM

awight closed this task as Resolved.Mar 30 2016, 6:46 PM

awight moved this task from Review to Done on the Fundraising Sprint Freshmaking board.

• atgo unsubscribed.Mar 30 2016, 6:51 PM

No one at the WMF uses PHP to produce Kafka messages yet,

Ah! That's not what I said! No one uses PHP to consume Kafka messages. Produce, yes:

Whoops, thanks for catching that slip! Yes, that's what I heard, just not what I wrote ;-)

Some discouraging developments are documented in T130283#2216191.

We've decided to reopen the decision about which backend data store to use, nothing is emitting an aura of idealness. It may be that Redis 2 plus replication and a manual failover protocol is the most stable and sane path forward.

If we use Kafka, we'll want to stay in sync with @Ottomata's efforts to get WMF using Kafka 0.9. It probably won't win us many features however, because the PHP client library doesn't seem to support native authentication yet. ZooKeeper is required by the brokers, so that isn't simplified away, either.

Redis 3 doesn't have a record of stable production deployment that I'm aware of, so its automatic failover might be out of reach for now.

N.b.: Assume that the new queue servers will be provisioned with Debian jessie.

We're going to shuffle tasks around in order to focus on the not-at-risk code changes first, getting all of our client code on the same php-queue library and possibly enhancing with a Kafkaesque mode where messages are persistent and consumers are associated with offset pointers. This is necessary anyway, for open sourceness and to prevent us from coupling to yet another tenuous technology.

There isn't a clear next step for this task, so we should check in again soon to make sure we keep momentum.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 6:39 PM

awight added a project: Fundraising Sprint Hermit Crab Husbandry.Apr 19 2016, 6:39 PM

@cwdent is throwing Postgres into the ring... we can also consider MySQL, all the queuelike things we do are easy to emulate.

News flash: MySQL eliminated as a candidate, by firing squad.

awight moved this task from Backlog to Doing on the Fundraising Sprint Hermit Crab Husbandry board.Apr 20 2016, 8:46 PM

awight claimed this task.Apr 21 2016, 12:59 AM

awight moved this task from Current Sprint & Completed in Q3 1516 to Current Sprint & Completed in Q4 1516 on the Fundraising-Backlog board.Apr 22 2016, 5:48 PM

awight renamed this task from Spike: Investigate suitability of Kafka instead of Redis to Spike: Choose a new backend for queueing.Apr 25 2016, 9:04 PM

• DStrine added a project: Fundraising Sprint Internet Exploring.Apr 27 2016, 10:25 PM

awight added a parent task: T133964: Implement AtomicReadBuffer for choice of PHP-Queue backend .Apr 29 2016, 12:38 AM

awight removed awight as the assignee of this task.May 4 2016, 9:53 PM

• DStrine added a project: Fundraising Sprint Jabberwock Slaying.May 11 2016, 10:22 PM

awight added a parent task: T120464: Deploy Redis 3 to frack.May 12 2016, 7:25 PM

Change 279744 abandoned by Awight:
[WIP] Mirror to Kafka

https://gerrit.wikimedia.org/r/279744

awight updated the task description. (Show Details)May 13 2016, 8:45 AM

I'll cast my vote for Redis 2. We can have a reasonable solution up in no time, and with the same manual failover characteristics as MySQL, the next point of failure.

In a year or two we might get an upgrade to automatic failover in Redis 3 for free or nearly so.

Please feel welcome, everyone, to cast an opinion or to abstain...

Some pros and cons listed on a Google doc:
https://docs.google.com/spreadsheets/d/1V2UpHdTH4FaTRQ3SZOoTikyA8HHsjTEYwxMNHu9YsCI

• DStrine added a project: Fundraising Sprint Killing Time.May 25 2016, 10:18 PM

I'm provisionally decreeing Redis 2 to be the winner. Please lodge any final complaints here, before the end of the sprint...

awight moved this task from Backlog to Review on the Fundraising Sprint Killing Time board.Jun 8 2016, 12:29 AM

• DStrine added a project: Fundraising Sprint Licking Cookies.Jun 8 2016, 10:57 PM

awight moved this task from Backlog to Done on the Fundraising Sprint Licking Cookies board.Jun 8 2016, 11:22 PM

Danny_B added a project: Spike.Jul 5 2016, 6:36 PM

• DStrine added a project: Fundraising Sprint Muggle Baiting.Jul 6 2016, 9:45 PM

awight closed this task as Resolved.Jul 7 2016, 1:16 AM

awight removed a project: Fundraising Sprint Muggle Baiting.

awight mentioned this in T130283: Provision Redis cluster for Fundraising.Jul 13 2016, 6:38 PM

• mmodell removed a subscriber: awight.Jun 22 2017, 9:48 PM

Change 279745 abandoned by Awight:
[WIP] looking at direct kafka integration.

https://gerrit.wikimedia.org/r/279745

Dwisehaupt moved this task from Triage to Done on the fundraising-tech-ops board.Feb 13 2020, 9:47 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptFeb 13 2020, 9:47 PM

TerraCodes unsubscribed.Feb 13 2020, 10:17 PM

Spike: Choose a new backend for queueingClosed, ResolvedPublic1 Estimated Story PointsSpikeActions