Data loss registered by Oozie for 2016-12-7-18:
elukey@stat1002:/mnt/hdfs$ cat wmf/data/raw/webrequests_data_loss/upload/2016/12/7/18/WARNING/000000_0 0 requests (0.0% of total) have incomplete records. 4009853 requests (2.043% of valid ones) were lost. elukey@stat1002:/mnt/hdfs$ cat wmf/data/raw/webrequests_data_loss/text/2016/12/7/18/WARNING/000000_0 249 requests (0.0% of total) have incomplete records. 6490442 requests (2.726% of valid ones) were lost. elukey@stat1002:/mnt/hdfs$ cat wmf/data/raw/webrequests_data_loss/misc/2016/12/7/18/WARNING/000000_0 0 requests (0.0% of total) have incomplete records. 3377 requests (1.277% of valid ones) were lost. elukey@stat1002:/mnt/hdfs$ cat wmf/data/raw/webrequests_data_loss/maps/2016/12/7/18/WARNING/000000_0 0 requests (0.0% of total) have incomplete records. 30210 requests (2.321% of valid ones) were lost.
No kafka errors registered in https://grafana.wikimedia.org/dashboard/db/varnishkafka?var-instance=webrequest, plus we have the following two settings for librdkafka that should help:
# The topic partition leader needs to ack the msg batch kafka.topic.request.required.acks = 1 # Maximum number of retries per messageset. kafka.message.send.max.retries = 3
I checked cp3033 varnishkafka-webrequest logs and I can see errors like the following for the time of the alarms:
Dec 07 18:47:12 cp3033 varnishkafka[1899]: KAFKAERR: Kafka error (-195): kafka1012.eqiad.wmnet:9092/12: Connect to ipv4#10.64.5.12:9092 failed: Connection refused Dec 07 19:00:25 cp3033 varnishkafka[1899]: %3|1481137225.865|FAIL|varnishkafka#producer-1| kafka1012.eqiad.wmnet:9092/12: Receive failed: Disconnected
I can see the following reports for librdkafka (quick glance):
https://github.com/edenhill/librdkafka/issues/736
https://github.com/edenhill/librdkafka/issues/437
So this could be due to:
- New alarms behaving in a unexpected way during Kafka restarts reporting false positives.
- New alarms more precise than the previous ones and reporting an issue that has been ongoing for a while.
About 2), we might have to check github issues for librdkafka, meanwhile for 1) we could check the distribution of the data loss (for example, are all the sequence numbers dropped contiguous or not?).