Page MenuHomePhabricator

Increase retention for mediawiki.revision-create on the kafka jumbo cluster
Closed, ResolvedPublic

Description

Generating the initial state of the wdqs streaming update requires parsing the TTL dumps (all and lexemes). On the first start the kafka consumer over mediawiki.revision-create needs to be positioned to the offsets related to the time the dump was started so that it can capture everything that could have been created after something was written in the dump.

The way we generate the dump and the time required to make it available in HDFS makes it difficult to work within the current 7 days retention period.

As a first test we plan to use jumbo, increasing the retention on this topic there up to 30 days would ease our ability to start testing the flink pipeline.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-analytics) [2020-05-27T13:42:02Z] <ottomata> increased Kafka topic retention in jumbo-eqiad to 31 days for (eqiad|codfw).mediawiki.revision-create - T253753

Did 31 days:

$ kafka topics --alter --topic eqiad.mediawiki.revision-create --config retention.ms=2678400000
$ kafka topics --alter --topic codfw.mediawiki.revision-create --config retention.ms=2678400000

$ kafka topics --describe --topic '^(eqiad|codfw).mediawiki.revision-create$'
Topic:codfw.mediawiki.revision-create	PartitionCount:1	ReplicationFactor:3	Configs:retention.ms=2678400000
	Topic: codfw.mediawiki.revision-create	Partition: 0	Leader: 1002	Replicas: 1002,1003,1004	Isr: 1004,1002,1003
Topic:eqiad.mediawiki.revision-create	PartitionCount:1	ReplicationFactor:3	Configs:retention.ms=2678400000
	Topic: eqiad.mediawiki.revision-create	Partition: 0	Leader: 1003	Replicas: 1003,1002,1005	Isr: 1002,1005,1003

An idea: How about sending back to kafka the update stream and make THAT one retention higher?
Moving retention to 30 days for revision-create will make a lot of data stay that wouldn't be necessary (about half of the data), while keeping only the updates should be enough.
Just an idea :)

@JAllemandou I think that is an option as well, the thing is that is it is transitional to help to bootstrap a test of the full pipeline. In the end we won't be using jumbo and thus won't be able to rely on a 30days retention on main so hopefully we'll be able to reset the retention back to 7days once we're done with the test.
To circumvent this particular problem (time to make the dumps available > retention) we could either:

  • send back the events that matter back to kafka and have higher retention like you suggest
  • create a dedicated job running on the analytics network to read the events stored in HDFS and figure out a way to make the resulting data available in kafka main
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Given that retention is not on puppet is this a setting that is communicated to a new node when it joins the cluster by the leader of the partition or similar?