Page MenuHomePhabricator

Investigate why gobblin pulls webrequest data late
Open, HighPublic

Description

Since we moved to gobblin we experience regular webrequest data-loss warnings/errors. Those are due to webrequest data for an hour arriving after the webrequest-refine job has done it's validation step.
This task is to investigate why this happens.

Event Timeline

Finding: The problem comes from some Gobblin tasks not pulling data!

Gobblin distributes the work of pulling data into tasks. In our webrequest setup each task is responsible to pull from one kafka partition, and is executed in its own yarn container.
Some of the tasks sometimes pull no data despite having configuration setup expecting to pull data (the expected end kafka-offset is greater than the start kafka-offset for the task).
Those tasks pulling no data don't fail and keep their end offset to the one of their previous task (no change in offset for the partitions), meaning that there is no data drop, as the next gobblin run will be asked to pull from the correct point.
This breaks our _IMPORTED flag mechanism, which looks at files written, and in this case no file is written, making the problem invisible to the flagger.

I tracked down the reason for the Gobblin task not pulling data to this line of the code:

ConsumerRecords<K, V> consumerRecords = consumer.poll(super.fetchTimeoutMillis);

https://github.com/apache/gobblin/blob/master/gobblin-modules/gobblin-kafka-1/src/main/java/org/apache/gobblin/kafka/client/Kafka1ConsumerClient.java#L164

where the records to be pulled from Kafka are first fetched using the kafka-driver-consumer poll(timeout) method (https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll-long-), which doesn't fail in case of timeout but returns an empty list of records!

There is a parameter in Gobblin config to tweak the kafka-fetch timeout, it defaults to 1s. A patch updating this to 5s will follow this comment. This solution doesn't solve the main problem of our flag-generation not being resilient to special cases (this one, and the one described in T286343

Change 720317 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update Gobblin kafka fetch timeout to 5s

https://gerrit.wikimedia.org/r/720317

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Change 720317 merged by Joal:

[analytics/refinery@master] Update Gobblin kafka fetch timeout to 5s

https://gerrit.wikimedia.org/r/720317