Page MenuHomePhabricator

Redesign EventStreams for better multi-dc support
Closed, ResolvedPublic

Description

During an outage of the main-kafka cluster, the producer (Event-Platform) and the consumers (ChangeProp and WMF-JobQueue) were switched to the backup data center in an order of minutes with a simple depool of the primary datacenter for the eventbus. However, for EventStreams that caused an outage since currently both eqiad and codfw instances are configured to listen to events in eqiad, also the switchover required a puppet patch and a puppet run across the cluster. More then that, switching back to eqiad caused a disruption for the clients since the offsets got reset, so clients couldn't reconnect.

We need to think of the ways to improve the situation and allow automatic switchover. Perhaps we could deprecate by-offset reconnection and only support by-timestamp reconnection and lower the expectations about duplicated/lost events during the switchover.

Event Timeline

fdans moved this task from Incoming to Event Platform on the Analytics board.

Perhaps we could deprecate by-offset reconnection and only support by-timestamp reconnection and lower the expectations about duplicated/lost events during the switchover.

Hm, maybe we can do both? The Last-Event-ID is set by KafkaSSE. We could include the timestamp for each message in the Last-Event-ID, as well as the offset. If on reconnect, the client provides Last-Event-ID with both offset and timestamp, and the offset doesn't exist, then instead of resetting to latest, we could attempt to look up the offset for the provided timestamp. This would allow offsets to work if they exist in Kafka, but fall back to timestamps if not.

This might get weird if the provided offset does happen exist in both Kafka clusters, but doesn't correspond to the same message, which is possible.

What's exactly the benefit of providing the exact offset? The timestamp has millisecond resolution and in practice, our topics have low enough traffic for a millisecond timestamps to be a very definitive id for a particular message.

H, you might be right. Especially for EventStreams, it is unlikely that consumers will care about resuming from an exact offset. I'd like to keep the offset resume supported, but we could make timestamp be the default in Last-Event-ID. That is, someone will need to manually supply the offset in Last-Event-ID if they want to use the offset. KafkaSSE will automatically set it to timestamp instead.

If a message arrives late, there are scenarios where a consumer will get a lot of extra messages, or may miss messages, by using timestamp instead of offset. I doubt those scenarios will be much of a problem for EventStreams though.

If a message arrives late, there are scenarios where a consumer will get a lot of extra messages

Arrives where? The timestamp we use is the timestamp of the message being written into kafka AFAIK, and I don't think it's realistic that MW->EventBus->Kafka latency will ever get super high unless we are having an outage, so I don't think this would be a problem.

The message timestamp should be set by the producer (Question for self: is EventBus doing this? it should), and it should correspond to meta.dt. I agree that the latency of MW->EventBus->Kafka latency will get high, but what about future stream processed events? E.g. if we wanted revisions-score's timestamp to correspond with the revision-create timestamp (not saying we do, but one could imagine somethign like that making sense).

But I agree that this isn't a problem for our current events, and multi-DC support is more important that precise message resumption for EventStreams.

The message timestamp should be set by the producer (is EventBus doing this? it should), and it should correspond to meta.dt

Ye, we have message.timestamp.type=CreateTime meaning it's set by the producer.

if we wanted revisions-score's timestamp to correspond with the revision-create timestamp (not saying we do, but one could imagine somethign like that making sense).

We control the timestamp we set, so in this discussion it's merely an ID of an event to restart consuming from, so I think this doesn't really matter. The only important part is that the resolution of a timestamp is fine-grained enough for us to be pretty sure we're not skipping/re-consuming a lot of messages.

Anyway, I think we have the consensus here. Let's switch to timestamps and make the offsets a fallback?

I'm going to add a config option to KafkaSSE: useTimestampForId. This will instruct KafkaSSE to set either timestamp or offset in the SSE id field, but not both. If the client really wants to use ID, it can override the Last-Event-ID header to have the offset.

Change 449736 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Update KafkaSSE to 72a9e95 (0.3.0)

https://gerrit.wikimedia.org/r/449736

Change 449758 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Set useTimestampForId to true to support multi DC with Kafka message timestamps

https://gerrit.wikimedia.org/r/449758

Change 449736 merged by Ottomata:
[mediawiki/services/eventstreams@master] Update KafkaSSE to 72a9e95 (0.3.0)

https://gerrit.wikimedia.org/r/449736

Change 449758 merged by Ottomata:
[mediawiki/services/eventstreams@master] Set useTimestampForId to true to support multi DC with Kafka message timestamps

https://gerrit.wikimedia.org/r/449758

Change 449765 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Update README and bump version to 0.1.0

https://gerrit.wikimedia.org/r/449765

Change 449765 merged by Ottomata:
[mediawiki/services/eventstreams@master] Update README and bump version to 0.1.0

https://gerrit.wikimedia.org/r/449765

Change 451081 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] EventStreams now supports multi DC, but still run from main-eqiad

https://gerrit.wikimedia.org/r/451081

Change 451081 merged by Ottomata:
[operations/puppet@production] Fix comment about EventStreams active/active mode

https://gerrit.wikimedia.org/r/451081