As we all know, the time has come to start thinking about what it is needed to replace the Varnish frontends with ATS. For Analytics, this means replacing varnishkafka.
Varnishkafka currently reads details about a HTTP request from the Varnish shared memory, assembles a JSON string and sends it to Kafka (via librdkafka). In the ATS world, IIUC, a special logger can be configured to send a string (containing data about a HTTP request) to a named pipe, that in turn will be collected and exposed by fifo-log-demux via a local socket. In theory the varnishkafka replacement should do the following:
- read from the socket exposed by fifo-log-demux
- send to Kafka via librdkafka
- publish metrics related to traffic sent/dropped/etc.. to Kafka (see https://grafana.wikimedia.org/dashboard/db/varnishkafka)
High level things to reason about:
- ATS is currently able to produce a string to a certain logger, that can be formatted in any way, even like it was JSON. This work is currently done by Varnishkafka, that after reading from shm explicitly encodes the data collected into valid JSON (taking care of things like escaping etc..). Ideally, to keep things simple, ATS could produce the JSON representation of a HTTP request directly to its logger, and the new tool should only read and deliver to Kafka. It needs to be investigated if this is possible or if some corner cases are left out (say weird escaping etc..).
- Varnishkafka is currently producing metrics to a JSON log file, containing two kind of data:
- librdkafka internal metrics (TLS latency, msgs sent, etc..)
- internal metrics like how many times the librdkafka delivery callback (when data is not delivered to Kafka) has been called
These metrics need to be preserved in the new tool, it is vital for Analytics. The new implementation should produce the same metrics in a way that allows us to distinguish between those of varnishkafka and those of the new system. For example, we currently have rdkafka_producer_msg_cnt{cluster="cache_text", source="webrequest"}. We should add a new label to tell the implementation (eg: software="varnishkafka").
- The new tool should support Prometheus natively
- The language to write the new tool with should be something that relies on a strong librdkafka wrapper, or possibly to librdkafka directly (if written in C).
- The roll-out strategy will need to take into account the fact that the new tool will need to report the same amount of traffic delivered to kafka (compared to varnishkafka). It seems very obvious but we'll need to make sure that the new tool does not contain clear bugs that cause data to be dropped without noticing it (even a 1% of traffic dropped silently for a weird reason will be a big problem for us).