Page MenuHomePhabricator

Check eventbus Kafka cluster settings for reliability
Closed, ResolvedPublic

Description

The presentation at http://www.slideshare.net/ConfluentInc/when-it-absolutely-positively-has-to-be-there-reliability-guarantees-in-kafka-gwen-shapira-jeff-holoman makes a set of recommendations for Kafka use cases that require reliable delivery. A summary is available on the last slide:

We are currently running with two brokers, which is less than the recommended minimum of three. As a result, we can't set min.insync.replicas = 2 without losing redundancy.

Should we consider adding a third broker? If we set Acks = all, does this mean that we'll block for acks from both replicas if they are considered in-sync?

Another article that goes in depth in the topic: http://126kr.com/article/5443qzxtcfn

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2016, 8:38 PM
GWicke updated the task description. (Show Details)
GWicke edited subscribers, added: Ottomata, Pchelolo; removed: Aklapper.

+1 on adding a third broker. Eventbus has increasing importance in our infra, so we need to start working onits reliability and fault tolerance.

Pchelolo added a comment.EditedSep 6 2016, 12:42 AM

Another potential improvement on the reliability front might be improving the extension. Right now if connection to the producer service times out, or write to kafka failed for some reason, the event is never produced. It doesn't happen a lot, but from time to time its possible (according to logs) and we've never actually tested what happens if the EventBus service is down.

Recently I've came across an article that proposes a solution we could use in the extension (somewhat to the end of the article, search for Async Rabbit). The idea is that in their system they write an event to the DB as a part of the same transaction as main data, and then they have an async work that goes over the DB table with events, and tries to send them, removing from the DB table only upon confirmed success (which for us would be 200 from the producer service with require_ack=2 with 3 brokers - it will guarantee that event is in kafka)

I've played with this approach a bit today, but my general lack of PHP knowledge and mediawiki core knowledge don't let me say for sure if it's feasible in our case. What do you think?

I like async rabbit idea in general; I did a similar thing for a message
delivery system at my last job. There, we did an INSERT DELAYED query that
inserted all the data into a catch all DB table and then proceeded to do
other stuff. If the other stuff (message delivery) succeeded, it would
then update a delivered_at timestamp/boolean field in the catcha all DB
table to indicate that the message was delivered. Then, a background job
would crawl the catch all table for records where delivered_at is NULL, and
attempt to redeliver the data. For us, ‘message deliver’ was multiple
inserts into different MySQL tables, all of which had unique keys we could
use for ‘exactly once’ delivery. For Kafka/EventBus, the best we could get
at this stage is at least once.

Also, note that eventlogging-service-eventbus is currently writing any
failed events to a local file, from which we can manually replay the
events. We could adapt something here to automatically replay the events
in this file. I had even considered making the failed events go to the
analytics-eqiad Kafka cluster instead of just a local file. But, this just
gets more and more hacky. If eventlogging-service-eventbus is up, it
should deliver events to Kafka. Petr and I found that this is STILL (I am
very annoyed at this) not the case during broker restarts. I haven’t been
able to reproduce any problems with broker restarts outside of production,
and I’ve run out of time to keep investigating this issue this quarter.

Anyway, blabla, let’s talk more in EventBus sync tomorrow. +1 to 3 brokers
for main-* Kafka clusters.

See T120242: Reliable (atomic) MediaWiki event production for earlier discussion on the same topic. We discussed writing events to an internal table as part of the primary transaction, followed by an attempt to send the message to EventBus & delete the message from the DB after commit. A background task (possibly part of the production logic) would retry DB events that are older than some retry delay.

Milimetric triaged this task as Normal priority.Sep 19 2016, 3:51 PM
Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.
Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 5:07 PM
Pchelolo edited projects, added Services (watching); removed Services.

Change 324200 had a related patch set uploaded (by Ottomata):
Set min.insync.replicas to 2 for main kafka cluster in production

https://gerrit.wikimedia.org/r/324200

Change 324200 merged by Ottomata:
Set min.insync.replicas to 2 for main kafka cluster in production

https://gerrit.wikimedia.org/r/324200

Change 324801 had a related patch set uploaded (by Ottomata):
Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer

https://gerrit.wikimedia.org/r/324801

Change 324801 merged by Ottomata:
Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer

https://gerrit.wikimedia.org/r/324801

I reverted back to min.insync.replicas=1.

During one of my kafka broker restarts today, I saw produce requests fail because there were < 2 insync replicas for some topics. From what the documentation says, the shouldn't be the case for our eventbus topics, since at the moment the eventbus producer is using the default acks=1. I believe the errors I saw were for Kafka's __consumer_offsets topics.

We should talk about this idea in general, I'm not so sure it is worth it for our cluster. Let's talk on Monday.

Qs:

  • With acks=all, if leader gets a producer request but the replicas don't ack, does the message still get written to the leader's logs? Will it be replicated if possible later? What happens to the message offsets on the replicas if not?
  • When`min.insync.replicas=2`, why did we see produce requests being rejected if our producers where defaulting to acks=1?
  • During 1 broker restart, why did we see produce requests being rejected at all? in-sync replicas should not have dropped to < 2.
elukey added a subscriber: elukey.Dec 5 2016, 5:06 PM
Pchelolo updated the task description. (Show Details)Dec 22 2016, 12:51 AM
Pchelolo added a comment.EditedDec 22 2016, 2:53 AM

@Ottomata I've found an excellent article that answers your first question. Highly recommended: http://126kr.com/article/5443qzxtcfn

With acks=all, if leader gets a producer request but the replicas don't ack, does the message still get written to the leader's logs? Will it be replicated if possible later? What happens to the message offsets on the replicas if not?

If there was less insync replicas then the min.insync.replicas=2 before the message comes to the leader broker it's not written to the log and NOT_ENOUGH_REPLICAS error is emitted. If by the time message comes and there's enough insync replicas, the message is written to the log and it's made readable to the consumers. Then the leader starts waiting for acks from ALL of the brokers. Then 2 things can happen:

  1. All brokers responded, number of acks is obviously larger then min.insync.replicas, success is reported to the client.
  2. During the waiting period some replicas answered, but some fell out of the insync list (because the default wait timeout is 5 minutes while default timeout for being insync is much less). So in our case if BOTH other replicas die after a message come before they could ack it, the NOT_ENOUGH_REPLICAS_AFTER_APPEND error is sent. Note, that since the message was already made readable to the clients messages can duplicate in this scenario.

When`min.insync.replicas=2`, why did we see produce requests being rejected if our producers where defaulting to acks=1?

That is weird. With acks=1 the minimum number of insync replicas shouldn't have any effect at all. Even if we've restarted a partition leader a new leader should be elected.. Did we enable graceful shutdown on our brokers?

During 1 broker restart, why did we see produce requests being rejected at all? in-sync replicas should not have dropped to < 2.

Hm... Maybe bug in the python driver? I've been playing with these settings and ChangeProp most of the day today and everything works like a charm with the node driver (after I've fixed a bug in it)

Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Feb 9 2017, 5:04 PM
Nuria added a subscriber: Nuria.

Related to kafka tier-1 goal for next year

I think I already asked this question to @Ottomata but I forgot the answer :D, so I am going to ask again: any reason to keep the replication factor to 2 (rather than 3) for the following topics?

elukey@kafka1002:~$ kafka topics --describe | grep ReplicationFactor:2
Topic:change-prop.retry.mediawiki.page_delete	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.mediawiki.page_move	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.mediawiki.page_restore	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.mediawiki.revision_create	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.mediawiki.revision_visibility_set	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.mediawiki.user_block	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:change-prop.retry.resource_change	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.page_delete	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.page_edit	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.page_move	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.page_restore	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.repage_move	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.revision_create	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.revision_visibility_set	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:mediawiki.user_block	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:resource_change	PartitionCount:1	ReplicationFactor:2	Configs:
Topic:test	PartitionCount:2	ReplicationFactor:2	Configs:
Topic:test.event	PartitionCount:1	ReplicationFactor:2	Configs:
fdans moved this task from Dashiki to Wikistats Production on the Analytics board.Jul 27 2017, 3:57 PM
Milimetric closed this task as Resolved.Apr 2 2018, 3:47 PM
Milimetric claimed this task.
Milimetric added a subscriber: Milimetric.

reopen if necessary