Page MenuHomePhabricator

Add VSL error counters to Varnishkafka stats
Closed, DeclinedPublic

Description

Sometimes Varnishkafka emits incomplete records that are dispatched to Kafka and stored in the Hadoop cluster. When this happens, the Analytics data consistency checks will detect the anomaly and alert via email.

Previous related task: T148412

Most of the times the incomplete records are generated by the varnishapi (that Varnishkafka uses) because a time/space limit has been breached. The error log message can be retrieved in the VSL tag and can take two forms:

  1. timeout
  2. store overflow

It would be really valuable to have some trace of these errors when they happen. I can think about two options:

  1. Add the VSL tag in the webrequest JSON structure that gets sent to Kafka. This would be the most accurate one but it could represent a headache for the Analytics team.
  1. Add VSL error counters to the json emitted by Varnishkafka, that we store under /var/cache/varnishkafka on our caching hosts (these values are then sent to graphite). For example:
Before
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218 } }

After
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218, vsl_error_store_overflows: 42, vsl_error_timeouts: 2 } }

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Why not both!? :)

Yes! I was concerned that the new field would have been a bit too much, but if we are ok with the new format we can do both. Maybe rather than vsl_error_store_overflows and vsl_error_timeouts we can simply have a generic vsl_error?

Let's (as a first step) send these errors to graphite.

Milimetric moved this task from Dashiki to Incoming on the Analytics board.
mforns lowered the priority of this task from Medium to Low.Apr 16 2018, 4:26 PM

Given than we are moving to ATS, I'd decline this task :)