Page MenuHomePhabricator

Create replacement for Varnishkafka
Closed, ResolvedPublic

Description

As we all know, the time has come to start thinking about what it is needed to replace the Varnish frontends with ATS. For Analytics, this means replacing varnishkafka.

Varnishkafka currently reads details about a HTTP request from the Varnish shared memory, assembles a JSON string and sends it to Kafka (via librdkafka). In the ATS world, IIUC, a special logger can be configured to send a string (containing data about a HTTP request) to a named pipe, that in turn will be collected and exposed by fifo-log-demux via a local socket. In theory the varnishkafka replacement should do the following:

High level things to reason about:

  • ATS is currently able to produce a string to a certain logger, that can be formatted in any way, even like it was JSON. This work is currently done by Varnishkafka, that after reading from shm explicitly encodes the data collected into valid JSON (taking care of things like escaping etc..). Ideally, to keep things simple, ATS could produce the JSON representation of a HTTP request directly to its logger, and the new tool should only read and deliver to Kafka. It needs to be investigated if this is possible or if some corner cases are left out (say weird escaping etc..).
  • Varnishkafka is currently producing metrics to a JSON log file, containing two kind of data:
    • librdkafka internal metrics (TLS latency, msgs sent, etc..)
    • internal metrics like how many times the librdkafka delivery callback (when data is not delivered to Kafka) has been called

These metrics need to be preserved in the new tool, it is vital for Analytics. The new implementation should produce the same metrics in a way that allows us to distinguish between those of varnishkafka and those of the new system. For example, we currently have rdkafka_producer_msg_cnt{cluster="cache_text", source="webrequest"}. We should add a new label to tell the implementation (eg: software="varnishkafka").

  • The new tool should support Prometheus natively
  • The language to write the new tool with should be something that relies on a strong librdkafka wrapper, or possibly to librdkafka directly (if written in C).
  • The roll-out strategy will need to take into account the fact that the new tool will need to report the same amount of traffic delivered to kafka (compared to varnishkafka). It seems very obvious but we'll need to make sure that the new tool does not contain clear bugs that cause data to be dropped without noticing it (even a 1% of traffic dropped silently for a weird reason will be a big problem for us).

Details

SubjectRepoBranchLines +/-
operations/software/atskafkamaster+3 -3
operations/software/atskafkamaster+4 -2
operations/puppetproduction+4 -5
operations/puppetproduction+5 -4
operations/software/atskafkamaster+27 -8
operations/software/atskafkamaster+9 -8
operations/software/atskafkamaster+11 -2
operations/puppetproduction+2 -2
operations/software/atskafkamaster+25 -0
operations/software/atskafkamaster+56 -11
operations/software/atskafkamaster+30 -1
operations/software/atskafkamaster+6 -57
operations/software/atskafkamaster+24 -9
operations/software/atskafkamaster+45 -67
operations/software/atskafkamaster+62 -0
operations/software/atskafkamaster+50 -0
integration/configmaster+3 -0
operations/puppetproduction+1 -3
integration/configmaster+1 -1
integration/configmaster+1 -0
operations/software/atskafkamaster+33 -8
operations/puppetproduction+3 -3
operations/puppetproduction+6 -0
operations/puppetproduction+8 -8
operations/puppetproduction+37 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2020-01-08T14:35:34Z] <ema> repool cp4028 after successful X-Analytics-TLS patch test T237993

Change 562841 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: escape hyphens in X-Analytics-TLS patterns

https://gerrit.wikimedia.org/r/562841

Change 562841 merged by Ema:
[operations/puppet@production] ATS: escape hyphens in X-Analytics-TLS patterns

https://gerrit.wikimedia.org/r/562841

Change 562535 merged by Ema:
[operations/puppet@production] ATS: add webrequest logging for atskafka

https://gerrit.wikimedia.org/r/562535

Mentioned in SAL (#wikimedia-operations) [2020-02-12T09:26:05Z] <ema> cp4027: ats-tls-restart to enable analytics logging to pipe T237993

Mentioned in SAL (#wikimedia-operations) [2020-02-12T09:38:37Z] <ema> cp: rolling ats-tls-restart to enable analytics logging T237993

Change 572009 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: remove 'tls:' prefix from X-Analytics-TLS

https://gerrit.wikimedia.org/r/572009

Change 572009 merged by Ema:
[operations/puppet@production] ATS: remove 'tls:' prefix from X-Analytics-TLS

https://gerrit.wikimedia.org/r/572009

Change 578328 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Use confluent-kafka-go instead of segmentio/kafka-go

https://gerrit.wikimedia.org/r/578328

Change 578328 merged by Ema:
[operations/software/atskafka@master] Use confluent-kafka-go instead of segmentio/kafka-go

https://gerrit.wikimedia.org/r/578328

Change 578491 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Debianization

https://gerrit.wikimedia.org/r/578491

Change 578498 had a related patch set uploaded (by Ema; owner: Ema):
[integration/config@master] Test atskafka with debian-glue

https://gerrit.wikimedia.org/r/578498

Change 578507 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Add basic testing

https://gerrit.wikimedia.org/r/578507

Change 578532 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Use json.Marshal

https://gerrit.wikimedia.org/r/578532

Change 578498 merged by jenkins-bot:
[integration/config@master] Test atskafka with debian-glue

https://gerrit.wikimedia.org/r/578498

Mentioned in SAL (#wikimedia-releng) [2020-03-10T14:58:41Z] <James_F> Add CI for operations/software/atskafka T237993

Change 578539 had a related patch set uploaded (by Ema; owner: Ema):
[integration/config@master] atskafka: use -backports

https://gerrit.wikimedia.org/r/578539

Change 578539 merged by jenkins-bot:
[integration/config@master] atskafka: use -backports

https://gerrit.wikimedia.org/r/578539

Change 578543 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] package_builder: assume backports exist

https://gerrit.wikimedia.org/r/578543

Change 578543 merged by Ema:
[operations/puppet@production] package_builder: assume backports exist

https://gerrit.wikimedia.org/r/578543

Change 578549 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Handle rdkafka statistics

https://gerrit.wikimedia.org/r/578549

Change 578924 had a related patch set uploaded (by Ema; owner: Ema):
[integration/config@master] atskafka: build with PBUILDER_USENETWORK

https://gerrit.wikimedia.org/r/578924

Change 578924 merged by jenkins-bot:
[integration/config@master] atskafka: build with PBUILDER_USENETWORK

https://gerrit.wikimedia.org/r/578924

Change 578491 merged by Ema:
[operations/software/atskafka@master] Debianization

https://gerrit.wikimedia.org/r/578491

Change 578507 merged by Ema:
[operations/software/atskafka@master] Add basic testing

https://gerrit.wikimedia.org/r/578507

Change 578532 merged by Ema:
[operations/software/atskafka@master] Use json.Marshal

https://gerrit.wikimedia.org/r/578532

Change 578549 merged by Ema:
[operations/software/atskafka@master] Handle rdkafka statistics

https://gerrit.wikimedia.org/r/578549

Change 579283 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Define a test pipeline with Blubber

https://gerrit.wikimedia.org/r/579283

Change 579298 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Install dependencies from Debian

https://gerrit.wikimedia.org/r/579298

Change 579298 merged by Ema:
[operations/software/atskafka@master] Install dependencies from Debian

https://gerrit.wikimedia.org/r/579298

Change 579283 abandoned by Ema:
Define a test pipeline with Blubber

https://gerrit.wikimedia.org/r/579283

Mentioned in SAL (#wikimedia-operations) [2020-03-16T13:49:50Z] <ema> upload atskafka 0.1 to buster-wikimedia T237993

Change 580340 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Load rdkafka configuration from file

https://gerrit.wikimedia.org/r/580340

Change 580340 merged by Ema:
[operations/software/atskafka@master] Load rdkafka configuration from file

https://gerrit.wikimedia.org/r/580340

Change 580376 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] atskafka: convert float64 configuration values to int

https://gerrit.wikimedia.org/r/580376

Change 580376 merged by Ema:
[operations/software/atskafka@master] atskafka: convert float64 configuration values to int

https://gerrit.wikimedia.org/r/580376

Mentioned in SAL (#wikimedia-operations) [2020-03-18T11:17:15Z] <ema> upload atskafka 0.3 to buster-wikimedia T237993

Change 580986 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Add datetime and sequence number

https://gerrit.wikimedia.org/r/580986

Change 580987 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Do not append to stats file

https://gerrit.wikimedia.org/r/580987

Change 581435 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add dt to atskafka log format

https://gerrit.wikimedia.org/r/581435

Change 581435 merged by Ema:
[operations/puppet@production] ATS: add dt to atskafka log format

https://gerrit.wikimedia.org/r/581435

Change 580986 merged by Ema:
[operations/software/atskafka@master] Add sequence number

https://gerrit.wikimedia.org/r/580986

Change 580987 merged by Ema:
[operations/software/atskafka@master] Do not append to stats file

https://gerrit.wikimedia.org/r/580987

Mentioned in SAL (#wikimedia-operations) [2020-03-19T10:47:24Z] <ema> upload atskafka 0.4 to buster-wikimedia T237993

Change 582060 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Reopen Unix socket after upon errors

https://gerrit.wikimedia.org/r/582060

Change 582060 merged by Ema:
[operations/software/atskafka@master] Reopen Unix socket upon read errors

https://gerrit.wikimedia.org/r/582060

Mentioned in SAL (#wikimedia-operations) [2020-03-23T16:31:53Z] <ema> upload atskafka 0.5 to buster-wikimedia T237993

Change 586298 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Disable wmf-analytics log

https://gerrit.wikimedia.org/r/586298

Change 586298 merged by Vgutierrez:
[operations/puppet@production] ATS: Disable wmf-analytics log

https://gerrit.wikimedia.org/r/586298

Mentioned in SAL (#wikimedia-operations) [2020-04-06T07:54:53Z] <vgutierrez> rolling restart of ats-tls to disable wmf-analytics log - T249335 T237993

Change 587422 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: Disable wmf-analytics log"

https://gerrit.wikimedia.org/r/587422

Change 587422 merged by Ema:
[operations/puppet@production] Revert "ATS: Disable wmf-analytics log"

https://gerrit.wikimedia.org/r/587422

There is currently too much data that flows to kafka, for cp3050 we have 36GB * 12 partitions for a single day, definitely too much.

I took a look to kafka messages and using kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t atskafka_test_webrequest_text | grep ":\"000" --color it is clear to see a lot of records containing things like "http_status":"000","ip":"-" that I think may be internals of ATS?

@ema let's figure out what they are, we'll probably have to filter them before kafka :)

There is currently too much data that flows to kafka, for cp3050 we have 36GB * 12 partitions for a single day, definitely too much.

How much data do we have for cp3052?

Me and Ema had several chats on IRC, reporting in here the summary:

  • the HTTP status 000 seems to be used for clients that have some trouble doing a HTTP request to ats-tls, without even reaching the Varnish frontend layer. Ema found that we have ~100 rps of HTTP 000 logged in esams for ats-tls, that match exactly what we see as transaction errors in Grafana.
  • Ema is going to filter out the above traffic from the analytics ats-tls log stream.
  • In order to get a better comparison between varnishkafka ad atskafka, we need to add Prometheus metrics. We discussed about re-using the varnishkafka exporter vs adding metrics directly to atskafka, the latter seems the way to go.

Change 594116 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Filter logs by regular expression

https://gerrit.wikimedia.org/r/594116

  • the HTTP status 000 seems to be used for clients that have some trouble doing a HTTP request to ats-tls, without even reaching the Varnish frontend layer. Ema found that we have ~100 rps of HTTP 000 logged in esams for ats-tls, that match exactly what we see as transaction errors in Grafana.
  • Ema is going to filter out the above traffic from the analytics ats-tls log stream.

I've opened https://gerrit.wikimedia.org/r/#/c/operations/software/atskafka/+/594116/ for the above.

  • In order to get a better comparison between varnishkafka ad atskafka, we need to add Prometheus metrics. We discussed about re-using the varnishkafka exporter vs adding metrics directly to atskafka, the latter seems the way to go.

Significant discussion happened on T196066 in the past regarding rdkafka-prometheus integration. To sum-up here, the idea proposed by @Ottomata was adding prometheus integration directly to rdkafka, which the kafka-clients mailing list did not disagree with, but then it seems that the discussion stopped there. We did end up writing an exporter named prometheus-varnishkafka-exporter. Considering that the exporter is not varnish-specific, I propose renaming it to prometheus-rdkafka-exporter and using it everywhere we use rdkafka instead of re-implementing the logic in golang (and node, and php, and whatever language we use rdkafka with).

I propose renaming it to prometheus-rdkafka-exporter and using it every

Sounds great! We could then use it in node instead of https://github.com/wikimedia/node-rdkafka-statsd

I propose renaming it to prometheus-rdkafka-exporter and using it every

Sounds great! We could then use it in node instead of https://github.com/wikimedia/node-rdkafka-statsd

Yup. What isn't ideal in this approach though, is that we either expose two different sets of metrics (application-specific metrics with the prometheus golang library, rdkafka metrics with prometheus-rdkafka-exporter) or we have to choose between the two. I see three options, none of which is great. :-)
Option (a) is to use the prometheus golang library and convert in golang from struct kafka.Stats (what we currently dump to disk as JSON) to prometheus format. The advantage is that adding application-specific metrics is straightforward and done in a standard way, plus we get all the very useful golang runtime metrics for free. The disadvantage is that we need to parse struct kafka.Stats and convert all fields to prometheus, re-inventing the wheel and almost certainly ending up with differences between, say, rdkafka metrics exposed by atskafka and those exposed by other applications. We could write a go library for this of course, but what about rdkafka programs written in other languages?

Option (b) is to use an external exporter such as prometheus-rdkafka-exporter. The advantage is that we get uniformity among programs using rdkafka, the disadvantage is that adding application-specific metrics becomes cumbersome. We need to essentially add custom metrics to the data structure dumped to disk as JSON, ensure counters atomicity, and in general take concurrency into account.

Option (c) is using both an external exporter for rdkafka metrics and make application-level metrics available using the native golang (or node, or whatever) prometheus library. We get all the advantages of (a), all the advantages of (b), but we end up with two prometheus exporters for each program.

We need to essentially add custom metrics to the data structure dumped to disk as JSON,

Wouldn't a prometheus-rdkafka-exporter expose the metrics via HTTP?

What isn't ideal in this approach though, is that we either expose two different sets of metrics (application-specific metrics with the prometheus golang library, rdkafka metrics with prometheus-rdkafka-exporter) or we have to choose between the two.

Isn't this this case for all apps that would use an rdkafka exporter? I do this in NodeJS now with basically just a callback. The rdkafka metrics are given to the callback, and the callback is responsible for parsing and exporting them somehow. Hmmm....perhaps it would be better to make an rdkafka exporter not actually 'export' metrics via HTTP, but to just reformat for Prometheus and then provide them to a callback? Then your app could include them in its own http exporter.

Change 594116 merged by Ema:
[operations/software/atskafka@master] Filter logs by regular expression

https://gerrit.wikimedia.org/r/594116

Chiming in with two cents and my Prometheus hat:

I agree with @ema that none of the options are great unfortunately. Rephrasing to make sure I understand: the major problem is making sure that the mapping from struct kafka.Stats to Prometheus metrics doesn't get fragmented across many different exporters and languages. Ideally the mapping is provided upstream but that's not the case yet.

Of all the options I believe (c) seems the cleanest to me from an logical standpoint: keep the application and rdkafka metrics separated. Deployment wise it will be a bit clunky (an rdkafka exporter per application) but admittedly that's already the case nowadays.

I don't know how big of an undertaking it'll be but a possible option (d) would be to talk to upstream and see if we could help with making rdkafka Prometheus metrics a reality.

HTH!

Rephrasing to make sure I understand: the major problem is making sure that the mapping from struct kafka.Stats to Prometheus metrics doesn't get fragmented across many different exporters and languages. Ideally the mapping is provided upstream but that's not the case yet.

Correct.

Of all the options I believe (c) seems the cleanest to me from an logical standpoint: keep the application and rdkafka metrics separated.

OK, I will take this route then. Thank you!

Change 595472 had a related patch set uploaded (by Ema; owner: Ema):
[operations/software/atskafka@master] Release 0.6

https://gerrit.wikimedia.org/r/595472

Change 595472 merged by Ema:
[operations/software/atskafka@master] Release 0.6

https://gerrit.wikimedia.org/r/595472

Mentioned in SAL (#wikimedia-operations) [2020-05-11T08:30:57Z] <ema> cp3050: upgrade atskafka to 0.6 T237993

ema claimed this task.

Closing this task now given that an initial version of atskafka has been created and deployed. Further improvements such as T253197, as well as the work required to extend the deployment, should be tracked in separate tasks.