Add VSL error counters to Varnishkafka stats
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	elukey
	May 2 2017, 10:58 AM

Description

Sometimes Varnishkafka emits incomplete records that are dispatched to Kafka and stored in the Hadoop cluster. When this happens, the Analytics data consistency checks will detect the anomaly and alert via email.

Previous related task: T148412

Most of the times the incomplete records are generated by the varnishapi (that Varnishkafka uses) because a time/space limit has been breached. The error log message can be retrieved in the VSL tag and can take two forms:

timeout
store overflow

It would be really valuable to have some trace of these errors when they happen. I can think about two options:

Add the VSL tag in the webrequest JSON structure that gets sent to Kafka. This would be the most accurate one but it could represent a headache for the Analytics team.

Add VSL error counters to the json emitted by Varnishkafka, that we store under /var/cache/varnishkafka on our caching hosts (these values are then sent to graphite). For example:

Before
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218 } }

After
{ "varnishkafka": { "time":1493706310, "tx":0, "txerr":0, "kafka_drerr":0, "trunc":0, "seq":1365036218, vsl_error_store_overflows: 42, vsl_error_timeouts: 2 } }

Related Objects

Mentioned Here: T148412: Varnishlog with Start timestamp but no Resp one causing data consistency check alarms

Event Timeline

elukey created this task.May 2 2017, 10:58 AM

Restricted Application added a project: SRE. · View Herald TranscriptMay 2 2017, 10:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema moved this task from Backlog to Caching on the Traffic board.May 2 2017, 11:01 AM

• ema subscribed.

+1 for that!
Thanks @elukey for raising this.

Why not both!? :)

In T164259#3227754, @Ottomata wrote:

Why not both!? :)

Yes! I was concerned that the new field would have been a bit too much, but if we are ok with the new format we can do both. Maybe rather than vsl_error_store_overflows and vsl_error_timeouts we can simply have a generic vsl_error?

• Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.May 4 2017, 4:26 PM

elukey added a project: User-Elukey.May 16 2017, 1:04 PM

Let's (as a first step) send these errors to graphite.

elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.May 18 2017, 4:42 PM

• Nuria moved this task from Backlog (Later) to Dashiki on the Analytics board.Jan 11 2018, 5:37 PM

elukey moved this task from Analytics Backlog to Backlog on the User-Elukey board.Feb 16 2018, 12:01 PM

Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:32 PM

Milimetric moved this task from Dashiki to Incoming on the Analytics board.

• Nuria moved this task from Incoming to Operational Excellence on the Analytics board.Apr 5 2018, 5:12 PM

mforns lowered the priority of this task from Medium to Low.Apr 16 2018, 4:26 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:13 PM

Given than we are moving to ATS, I'd decline this task :)

Add VSL error counters to Varnishkafka statsClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

Add VSL error counters to Varnishkafka stats
Closed, DeclinedPublic
Actions