Page MenuHomePhabricator

Add a custom watermark generator that supports low rate streams
Closed, ResolvedPublic

Description

When mixing incoming streams with different rate (e.g. one high volume stream and another with rare messages) the extractor BoundedOutOfOrdernessTimestampExtractor we use is not designed to deal with such configuration as it requires continuous flow on the stream to emit "sane" watermarks.

In the case a stream is not emitting messages for long the whole pipeline might be stuck just waiting for a message from that stream to trigger the next lower watermark. The reason is that the default watermark generators in flink do not take the responsibility to decide if the stream is simply late or inactive.

There seems to be two solutions to address this problem:

  • have a watermark generator that also inspect the processing time and emits a watermark with processing time - Xsec if no events have been seen in a while, this will unblock downstream operators like windows function waiting for low watermark for firing the windows that started to accumulate.
  • Investigate using org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext#markAsTemporarilyIdle (wrap FlinkKafkaConsumer ?)

AC:

  • when adding a low rate stream the operators relying on low watermarks must not be waiting for events from this stream to trigger their operations.

size: M

Event Timeline

dcausse created this task.Jul 2 2020, 8:17 AM
Restricted Application added a project: Wikidata. · View Herald TranscriptJul 2 2020, 8:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Zbyszko added a subscriber: Zbyszko.Jul 7 2020, 1:35 PM

Apache Flink 1.11 was released and with that (https://issues.apache.org/jira/browse/FLINK-17653) we should be able to assign different watermark assigners.

Change 625931 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Switch to WatermarkStrategy and support idleness

https://gerrit.wikimedia.org/r/625931

dcausse claimed this task.Tue, Sep 8, 3:54 PM
dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 625931 merged by jenkins-bot:
[wikidata/query/rdf@master] Switch to WatermarkStrategy and support idleness

https://gerrit.wikimedia.org/r/625931

Gehel closed this task as Resolved.Mon, Sep 14, 1:07 PM