Sometimes Varnishkafka emits incomplete records that are dispatched to Kafka and stored in the Hadoop cluster. When this happens, the Analytics data consistency checks will detect the anomaly and alert via email.
Previous related task: T148412
Most of the times the incomplete records are generated by the varnishapi (that Varnishkafka uses) because a time/space limit has been breached. The error log message can be retrieved in the VSL tag and can take two forms:
- timeout
- store overflow
It would be really valuable to have some trace of these errors when they happen. I can think about two options:
- Add the VSL tag in the webrequest JSON structure that gets sent to Kafka. This would be the most accurate one but it could represent a headache for the Analytics team.
- Add VSL error counters to the json emitted by Varnishkafka, that we store under /var/cache/varnishkafka on our caching hosts (these values are then sent to graphite). For example:
Before { "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218 } } After { "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218, vsl_error_store_overflows: 42, vsl_error_timeouts: 2 } }