Maniphest T161731

Create reliable change stream for specific wiki
Closed, ResolvedPublic
Actions

Description

Requirements:

Real-time - I can get changes from wiki within small time (<30 seconds) of the time they happened, with that time being defined as the time the changes have been committed to the database and visible to the users on wiki.
Reliable - if I consume every individual change message in the stream in the sequence the stream provides, I will know about all the changes in the wiki content.
Seekable - I can connect to the stream in a predictable point at wiki history (either by timestamp or by RC ID) and get all the messages. At least 14 days of messages back should be available, but larger availability is not a must.
Resumable - I can disconnect from the stream and then reconnect later and resume consumption from the same point I have left it. The service should not require constant connection for getting the updates, and the stream from disconnected and resumed connection should be the same as if connection has never been interrupted (except the obvious difference in message delivery times, etc.)
Scalable - there's no hard limit on the number of clients connecting, within the reasonable limits of infrastructure, networking, etc.
Stateless - the server does not keep per-client state and the client always has all the information to continue stream consumption from the point it has stopped at (this may be not a very important one if scalability is kept).

Current use case:

Supplying update stream for Wikidata Query Service.

Delta for existing services:

API:Recentchanges

Not reliable - messages can be backfilled into the stream with timestamp many seconds in the past, which means sequential reading of the stream by timestamp would miss those (T161342). Even if reading by RCID is implemented, parallel commits could still lead to backfilling and thus unreliable stream.
Does not have events for page props updates (T145712) which happen asynchronously from main article update.
May miss some deletion events if it is combined with revision hiding.

EventStreams/Kafka, as currently implemented

Not seekable by timestamp
Does not have data back more than 7 days

Details

	Subject	Repo	Branch	Lines +/-
	Set default topic timestamp.type to LogAppendTime	operations/puppet	production	+13 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Lucas_Werkmeister_WMDE	T145712 Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks )
Resolved	Ottomata	T161731 Create reliable change stream for specific wiki
Resolved	Ottomata	T175461 Port Kafka clients to new jumbo cluster
Declined	Ottomata	T176352 Port statsv to kafka-jumbo
Resolved	Ottomata	T179093 Support multi DC statsv
Resolved	Ottomata	T183297 Move EventLogging analytics processes to Kafka jumbo-eqiad cluster
Resolved	Ottomata	T185136 Move webrequest varnishkafka and consumers to Kafka jumbo cluster.
		Restricted Task
Resolved	Ottomata	T187890 Refactor kafkatee module to support multi instance
Resolved	Ottomata	T185225 Move EventStreams to main Kafka clusters
Resolved	Ottomata	T196553 Support connection/rate limiting in EventStreams
Resolved	Ottomata	T188136 Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to eventgate-analytics
Declined	None	T126494 Send Mediawiki Kafka logs to Kafka jumbo cluster with TLS encryption
Resolved	dcausse	T188408 Migrate mjolnir Kafka clients to use Kafka jumbo
Resolved	Ottomata	T189464 Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo
Resolved	Ottomata	T189611 Alert for Kafka MirrorMaker lag
Resolved	Ottomata	T190049 Spike: Consider alternatives to MirrorMaker: uReplicator, Confluent Replicator
Resolved	Ottomata	T189713 Migrate eventbus camus to Kafka jumbo
Duplicate	Ottomata	T189716 Migrate EventStreams to Kafka Jumbo
Resolved	Ottomata	T187296 Increase kafka event retention to 31
Declined	Ottomata	T157092 Support per-topic configuration in EventBus service

Event Timeline

Smalyshev created this task.Mar 29 2017, 6:25 PM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED, Analytics. · View Herald TranscriptMar 29 2017, 6:25 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Smalyshev triaged this task as Medium priority.Mar 29 2017, 6:26 PM

The pony being requested here is not going to happen in the action API, removing tag. This is probably an effective duplicate of T152731: Implement server side filtering for EventStreams (if we should) because EventStreams the most likely place for something like this to be implemented.

I think EventStreams is closest to the goal too, but I want to have a complete description of the pony for the record so that we know what we need and what is missing. If and once it's implemented (T152731 covers part of it but not all - still need seeking and longer backlog) we could close it, it doesn't have to have any patches dedicated for it, being an epic.

Smalyshev moved this task from Incoming to Need investigation on the Wikidata-Query-Service board.Mar 30 2017, 1:14 AM

• Nuria assigned this task to Ottomata.Apr 6 2017, 4:28 PM

• Nuria moved this task from Incoming to Wikistats on the Analytics board.

@Smalyshev let's jump in a hangout sometime to discuss this more.

Just a few quick points:

Does not have data back more than 7 days

We could probably bump this up to 14 days for specific topics like recentchange.

Scalable - there's no hard limit on the number of clients connecting

EventStreams does not have a 'hard limit', but it definitely isn't intended to be used for wiki-reader scale. Scalable service updates should be fine though.

Not consumable per-wiki due to the lack of filtering (T152731)

We haven't done this because we are waiting for a strong use case. Implementing it will be a lot of bike shedding, and we've put that off due to a lack of solid use case. If you really need this, say so, and we will start the bike shedding!

Qs:

We have a more than recentchange in other schemas, and we can add more. If there is a desire, we can expose these in EventStreams. Do you have desire? :)

Does WDQS run in production? I think it does, right? If so, you may want to consider consuming from Kafka rather than EventStreams. Kafka consumers support parallelization and scale much better than the EventStreams HTTP Kafka consumer proxy.

let's jump in a hangout sometime to discuss this more.

Would be glad to. I'll try to set up something next week.

If there is a desire, we can expose these in EventStreams. Do you have desire? :)

Yes, see T145712 - recentchanges ignores pageprops updates, and it would be nice to have those too.

Does WDQS run in production?

Yes.

If so, you may want to consider consuming from Kafka rather than EventStreams.

I am considering this too, but I assume it's more code for me to write (maybe wrongly, I didn't look at it closely).

Kafka consumers support parallelization and scale much better than the EventStreams HTTP Kafka consumer proxy.

I may have written requirements to ask for more than I need, it gives an impression it has to serve Wikipedia-scale consumer load maybe. But what in fact I need is something like 10-20 parallel consumers, that order of magnitude. So I don't really foresee too much problem here as long as it doesn't require doing unindexed DB queries. I expect both Kafka and ES be fine with it.

If so, you may want to consider consuming from Kafka rather than EventStreams.

I am considering this too, but I assume it's more code for me to write (maybe wrongly, I didn't look at it closely).

It will be more, a lot more. What language are you working in?

But what in fact I need is something like 10-20 parallel consumers

Load balanced parallel consumers, or all distinct consumers doing different stuff?

Nice thing about using a Kafka client directly, is you can subscribe to multiple topics/partitions in a single consumer group, and have the load automatically balanced between them (up to the total number of topic-partitions). If any of those processes goes down, a different consumer process will be auto-assigned its work.

It will be more, a lot more. What language are you working in?

The end consumer will be Java, but I don't want to consume the raw Kafka stream from Java, I'd rather have some intermediary that cleans up, deduplicates, etc. the changes.

Load balanced parallel consumers, or all distinct consumers doing different stuff?

All distinct. They will be pretty close to each other, but nothing forces them to consume exactly the same things, they are completely independent.

If any of those processes goes down, a different consumer process will be auto-assigned its work.

That's kind of the opposite of what I need :) Each client is independent and should get all the updates.

I'd rather have some intermediary that cleans up, deduplicates, etc. the changes.

FYI, neither base Kafka Consumer clients nor EventStreams does this.

FYI, neither base Kafka Consumer clients nor EventStreams does this.

Yes, I know :) It's one of the decisions I still haven't figured out - how much I can/should do on the backend so I don't have to do it on the client, vs sending the client the raw firehose output and let it do all the work. Minimal req is reliability (which I don't have now) but I am also interested in better performance such as making clients download and process less irrelevant data.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.May 5 2017, 2:29 PM

ping @Smalyshev is this still a need? Maybe we should set up a short 30 minute sync up

@Nuria yes, still very much needed and unsolved. Please feel welcome to set up a meet.

Smalyshev added a parent task: T145712: Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks ).Jun 1 2017, 8:15 PM

Smalyshev mentioned this in T145712: Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks ).

Smalyshev updated the task description. (Show Details)Jun 1 2017, 8:28 PM

From meeting:

@Smalyshev can consume from either kafka or event stream once we add the ability to consume from a given point in time, this is what is mean by "seekable" (on new kafka cluster, next quarter, Q1) .

Keeping data for longer than 7 days is not an issue for topics as small as these. The volume of events is less than 100 per sec.

Ability to filter by wiki is not a blocker as given volume of events filtering is possible client side.

Action Item: @Nuria
and @Ottomata to ping once new time-based consumption is enabled in kafka

As the result of the discussion, we've arrived to a following conclusion:

After we have Kafka version installed that allows to start by timestamp, we can create a prototype that takes recent changes from either Kafka or EventStreams.
We need to evaluate if unfiltered stream won't hurt performance, but current assumption is that it won't.
We may need to store Kafka offset in the DB for processed messages, will probably need to add code to do this since right now RdfRepository does not get any such data from change source or Change class.

Ladsgroup subscribed.Jun 11 2017, 6:56 AM

Krinkle moved this task from Backlog to Backlog on the EventStreams board.Jun 22 2017, 8:34 PM

• Nuria moved this task from Wikistats to Dashiki on the Analytics board.Jun 26 2017, 4:09 PM

• Nuria moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 6 2017, 4:40 PM

Smalyshev mentioned this in T168214: Updater misses updates when two updates happen very close to each other.Jul 27 2017, 9:58 PM

• fdans moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 23 2017, 4:14 PM

@Ottomata, @Nuria what's the status on seekable Kafka streaming - do we have necessary infrastructure now?

Unfortunately not yet! We are very close...the cluster is up and running, but porting clients has been blocked on getting proper keys and certificates for SSL support for a long time now. SSL is finally moving now, so we should be able to start porting clients over soon. We have a goal of getting some of the misc varnish traffic ported to the new cluster this quarter, and will likely (not totally sure) make a goal of getting all remaining clients ported next quarter.

Sorry this is taking so long! Lots of moving parts...

Ottomata added a subtask: T175461: Port Kafka clients to new jumbo cluster.Dec 4 2017, 2:19 PM

@Ottomata Could @Smalyshev do a test on consuming from the new cluster though with teh understanding it is not yet productionized to make sure it fits the use cases?

Sure, I suppose! You can connect to it with a Kafka client now. The Kafka brokers are kafka-jumbo100[1-6].eqiad.wmnet:9092

I think you are most interested in the eqiad.mediawiki.revision-create topic. I haven't tried yet at all, but these topics should have a broker received timestamp index on them. I am not yet familiar with the Kafka client APIs that know how to consume from timestamps, but they should be out there.

@Ottomata thanks, I can connect to the hosts above, but still not sure how to control the starting point. I'll try to look around for clients that can do this.

Smalyshev added a project: User-Smalyshev.Dec 5 2017, 11:14 PM

In T161731#3814596, @Smalyshev wrote:

@Ottomata thanks, I can connect to the hosts above, but still not sure how to control the starting point. I'll try to look around for clients that can do this.

Java client has offsetsForTimes implemented and supports seek to an offset. Same is supported by librdkafka and most of the language-specific clients are built atop librdkafka so I would expect a lot of them already support this functionality.

@Pchelolo thanks for the pointer, this is very helpful!

Indeed, kafkacat for example supports it since a year ago. However, looks like we have this version of Kafka:

Copyright (c) 2014-2015, Magnus Edenhill
Version KAFKACAT_VERSION (JSON) (librdkafka 0.9.3)

which doesn't seem to have it yet (checked on stat1005). Maybe we could upgrade to 1.3.1.

Indeed. Created T182163 for updating kafkacat but it is a bit more complex then simply installing new version - it depends on newer librdkafka, so we firs should upgrade and test that.

You can easily 'quickbuild' kafkacat with a statically linked librdkafka. I've just done this on a stretch labs host, and copied the kafkacat binary to stat1005 at /home/otto/kafkacat. Try it out!

• Nuria added a subscriber: JAllemandou.Dec 6 2017, 6:24 PM

/home/otto/kafkacat runs fine but -Q seems to return this for everything:

eqiad.mediawiki.revision-create [0] offset -1

Maybe I'm doing something wrong?

Hm, actually if I just try to consume from that topic (any topic actually) with -F "%T" that should give me message timestamps it gives -1 as well.

Seems like the mirroring is done by 0.9 MirrorMaker and timestamp handling was added only in 0.10 MirrorMaker.

I got same doing:

/home/otto/kafkacat -Q -b kafka-jumbo1003.eqiad.wmnet -t eqiad.mediawiki.revision-create:0:1512687299 -Xdebug=all

Seems like the mirroring is done by 0.9 MirrorMaker and timestamp handling was added only in 0.10 MirrorMaker.

Hm, ya but I had thought that if a timestamp was not set by the producer, it would be set to server receive time. Maybe I was wrong!

Change 396439 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set default topic timestamp.type to LogAppendTime

https://gerrit.wikimedia.org/r/396439

Change 396439 merged by Ottomata:
[operations/puppet@production] Set default topic timestamp.type to LogAppendTime

https://gerrit.wikimedia.org/r/396439

Woot, that did it ^. We need topics to default to LogAppendTime.

[@stat1005:/home/otto] $ ./kafkacat -Q -b kafka-jumbo1001.eqiad.wmnet:9092 -t eqiad.mediawiki.revision-create:0:1512759190000
eqiad.mediawiki.revision-create [0] offset 3658631

Nice, Can @Smalyshev check whether consuming from these topics as set would work for his purposes?

So, FYI, the timestamps as they are now are the timestamp that the kafka jumbo-eqiad cluster received the messages. These are replicated from the main-eqiad cluster, and might have a short (seconds usually, minutes max) delay.

Eventually (work not planned yet) we will upgrade the other Kafka clusters, and also configure producers to set timestamps based content time, e.g. revision create timestamp, instead.

@Nuria yes, consuming the data works.

@Smalyshev Ok, we aim to have the cluster handling all prod traffic by end of next quarter, until then it will be mirroing data which i think should be sufficient for you to get started in the wdqs consumer? Correct me if I am wrong.

@Nuria yes mostly, though I do have some questions, maybe we should set up a short meeting to discuss them?

@Smalyshev Please, 45 minutes with me and @Ottomata would do?

yes, definitely

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Jan 27 2018, 7:42 PM

Smalyshev mentioned this in T185951: Create kafka-based recent change poller.Jan 29 2018, 9:56 PM

Smalyshev moved this task from Next to Backlog on the User-Smalyshev board.Jan 29 2018, 11:47 PM

Smalyshev moved this task from Backlog to Waiting/Blocked on the User-Smalyshev board.Jan 30 2018, 7:26 PM

Ottomata mentioned this in T187225: Set up a Cloud VPS Kafka Cluster with replicated eventbus production data .Feb 13 2018, 6:58 PM

Smalyshev mentioned this in T187241: Add page-related topics to EventStreams.Feb 13 2018, 7:58 PM

Smalyshev moved this task from Need investigation to Watching / Waiting on the Wikidata-Query-Service board.Feb 22 2018, 7:30 AM

• fdans moved this task from Wikistats to Radar on the Analytics board.Mar 8 2018, 6:39 PM

• Nuria closed subtask T187296: Increase kafka event retention to 31 as Resolved.Jun 25 2018, 11:18 PM

Ping @Smalyshev now that you have a reliable stream on the new kafka cluster (that supports time-based consumption) is there any other blockers on your end ?

@Nuria I don't see any immediate blockers so far.

OO yes @Smalyshev and in case you didn't see, we also increased retention of mediawiki topics to 31 days in the main kafka clusters.

Can we close this task?

• Nuria closed this task as Resolved.Apr 24 2019, 4:55 PM

Ottomata closed subtask T175461: Port Kafka clients to new jumbo cluster as Resolved.Jun 24 2019, 7:32 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:42 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 6:45 PM