Page MenuHomePhabricator

Add VSL error counters to Varnishkafka stats
Closed, DeclinedPublic

Description

Sometimes Varnishkafka emits incomplete records that are dispatched to Kafka and stored in the Hadoop cluster. When this happens, the Analytics data consistency checks will detect the anomaly and alert via email.

Previous related task: T148412

Most of the times the incomplete records are generated by the varnishapi (that Varnishkafka uses) because a time/space limit has been breached. The error log message can be retrieved in the VSL tag and can take two forms:

  1. timeout
  2. store overflow

It would be really valuable to have some trace of these errors when they happen. I can think about two options:

  1. Add the VSL tag in the webrequest JSON structure that gets sent to Kafka. This would be the most accurate one but it could represent a headache for the Analytics team.
  1. Add VSL error counters to the json emitted by Varnishkafka, that we store under /var/cache/varnishkafka on our caching hosts (these values are then sent to graphite). For example:
Before
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218 } }

After
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218, vsl_error_store_overflows: 42, vsl_error_timeouts: 2 } }

Event Timeline

elukey created this task.May 2 2017, 10:58 AM
Restricted Application added a project: Operations. · View Herald TranscriptMay 2 2017, 10:58 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema moved this task from Triage to Caching on the Traffic board.May 2 2017, 11:01 AM
ema added a subscriber: ema.

+1 for that!
Thanks @elukey for raising this.

Why not both!? :)

elukey added a comment.May 2 2017, 2:42 PM

Why not both!? :)

Yes! I was concerned that the new field would have been a bit too much, but if we are ok with the new format we can do both. Maybe rather than vsl_error_store_overflows and vsl_error_timeouts we can simply have a generic vsl_error?

Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.May 4 2017, 4:26 PM
Nuria added a subscriber: Nuria.May 16 2017, 1:15 PM

Let's (as a first step) send these errors to graphite.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.May 18 2017, 4:42 PM
Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Jan 11 2018, 5:37 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:32 PM
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
mforns lowered the priority of this task from Medium to Low.Apr 16 2018, 4:26 PM
elukey closed this task as Declined.Jan 3 2020, 10:58 AM

Given than we are moving to ATS, I'd decline this task :)