Page MenuHomePhabricator

varnishkafka integration with Varnish 4 for analytics
Closed, ResolvedPublic34 Estimated Story Points

Description

could be a blocker for the Varnish 4 update, but we need precise requirements.

Event Timeline

ema raised the priority of this task from to Needs Triage.
ema updated the task description. (Show Details)
ema added projects: SRE, Traffic.
ema added subscribers: ema, Southparkfan, Aklapper and 4 others.

Just so it doesn't get lost in this process: https://gerrit.wikimedia.org/r/#/c/230173/

I still want to merge that and use it one day... :)

elukey set Security to None.
Milimetric edited projects, added Analytics; removed Analytics-Kanban.
Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.
Milimetric updated the task description. (Show Details)

Summary of my findings so far:

Varnish logs various kind of data (statistics, requests handled, etc..) in a shared memory file rather than in a file to avoid as much as possible disk I/O (using tmpfs also). Varnish organizes this shared memory file into "chunks" with a specific tag and each external process like Varnish-Kafka can read them and execute some action (in this case, publishing the data to a kafka topic). More info in:

https://www.varnish-cache.org/docs/4.1/reference/vsm.html#the-varnish-way

I tried to build varnish-kafka on Debian Jessie plus varnish repos from https://www.varnish-cache.org/installation/debian with these pkg version:

vagrant@debian-jessie:~$ dpkg --list | grep varnish
ii  libvarnishapi-dev              4.1.1-1~jessie              amd64        development files for Varnish
ii  libvarnishapi1                 4.1.1-1~jessie              amd64        shared libraries for Varnish
ii  varnish                        4.1.1-1~jessie              amd64        state of the art, high-performance web accelerator

varnish-kafka didn't compile due to a structural change in varnish apis (so headers, paths, etc..). The following guides turned out to be really useful:

https://www.varnish-cache.org/sites/default/files/Varnish%204%20Logging%20API.pdf
https://www.varnish-cache.org/sites/default/files/05-VUG8-Logging%20in%20Varnish%204_0.pdf
https://www.varnish-cache.org/docs/4.1/reference/vsl.html
http://info.varnish-software.com/blog/logging-varnish-40

I then compared varnishncsa's (dumps varnish shm into Apache-like file logs) source code diffs from 3.0 and 4.1 finding something really useful (also highlighted in the above presentations): there is a new lib called Varnish Utils (lib /libvarnishtools/vut.c) that is shared among all the varnish util binaries and does a lot of their heavy lifting in dealing with Varnish and shm. It turned out though that the related header is not included in /usr/include/varnish/* due to a rule in Makefile.am (thanks @ema for the help) that doesn't consider this util lib as "api" but a varnish-only build requirement. It is otherwise possible to use the new api (starting from vapi/vsl.h) to update the existing code.

Still need to figure out the best way to proceed, but the Varnish Util library seems really nice and it would be great to re-use some existing code already uber tested and validated by the Varnish team.

OH awesome! Does the Varnish Utils lib have a similar ABI enough to just
work with existing varishkafka code?

@Ottomata not sure yet, but it would be nice to have a single source code branch.

The next step is for me and @ema to figure out if and how we could use the utils library. Major issues: 1) deb packaging since it is not available in the varnish/debian repo 2) does it fit our use case or are there any weird corner cases?

Although it would be great for libvarnishtools to be a proper shared library, that is apparently not going to happen anytime soon in varnish itself, maybe for 5.0.

Let's simply copy the relevant files in varnishkafka's tree.

So the plan would be to create two separate branches:

  1. 3.X containing the current codebase
  2. 4.X containing the vut.c/.h files to leverage the tools library, or simply just using the new api directly

@Ottomata I don't believe that we could avoid this because the api has changed radically :( but we can have a chat about it!

Sounds good. Let’s make a new 3.x branch now, and make master be for 4.x
support.

Would it be helpful to build a varnish utils .deb package instead of
directly adding sources? If so, I can help with that.

Adding a link that might be useful for the future if we want to move away from varnish-kafka written in C:

https://github.com/xcir/python-varnishapi (available for Varnish 3 and 4)

@Ottomata: I discussed the new deb util package with @ema but it would really painful to manage for the people outside wmf if we update our repositories only (also a lot of people uses Varnish repositories). The quickest and less painful solution is adding the .c and .h directly to varnish kafka for the moment, and work with upstream to include the package for a next release. What do you think?

Sounds good to me!

Re: python varnish C types: cool! I didn’t know there was an existing lib
out there for that. Ori and I wrote some python varnish bindings[1], and
we use those for request stats[2] and other things[3]. I doubt we’d want
to not use the C API for varnishkafka though. The librdkafka library is
very good; probably better than any python client.

[1]
https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishlog.py
[2]
https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishreqstats
[3]
https://github.com/wikimedia/operations-puppet/blob/production/modules/varnish/files/varnishstatsd

Change 274135 had a related patch set uploaded (by Ema):
Preiliminary port to new VSL API

https://gerrit.wikimedia.org/r/274135

So the trivial part is done, see https://gerrit.wikimedia.org/r/274135. Now we need to figure out the tricky one. :)

Essentially, the following symbols are gone:

  • VSL_S_CLIENT, VSL_S_BACKEND
  • VSL_Matched
  • VSL_Dispatch
  • SLT_ReqEnd

VSL_S_CLIENT and VSL_S_BACKEND might be particularly problematic as they seem to be used in a smart way by varnishkafka, see https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L908 and https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1182

Finally we should have a good picture of what changes between 3.0 and 4.0 api from a general overview:

To summarize:

  • A transaction in the 4.0 world is a set of log entries with something in common, eg: a client request and backend request.
  • Transactions can be grouped together to form a transaction group, using session/request/etc.. In this case if a client request triggers multiple backend calls everything is grouped.
  • A tag is a log in the transaction (backend request, etc..)

Varnish-kafka sets a callback for each "tag" read from the varnish shm (because transactions weren't there in 3.0) and using a parser plus other matching heuristics establish if the tag needs to be formatted and sent to kafka.

VUT helps a lot in this, I updated Ema's patch with a preliminary example about the code to add (discussed with Ema during pair programming).

The main problem now remains the function that parses the log entries since it diverged a lot from varnishncsa (because the author implemented a quicker way to do it) and it is a bit complicated to debug. We are getting closer though :)

Summary after today's brainstorming with @ema:

  • We progressed a bit the remaining unclear points about the parse* functions, some testing is needed to prove our suspicions.
  • varnishncsa has been rewritten from scratch for Varnish 4 (to use vut.c properly) and maintaining varnish-kafka feels a bit like re-inventing the wheel.
  • Would it be feasible to just reuse varnishncsa and pipe its output (apache like log format) to a piece of software that just parse it, transform it in JSON and then publishes the result to a kafka topic? Basically the opposite of what https://github.com/wikimedia/analytics-kafkatee does more or less. This choice would guarantee 1) to re-use completely varnish's code and forget about maintaining varnish-kafka; 2) to have a more configurable piece of software for analytics to just parse some input and publish to kafka. There are still some performance doubts to resolve, but it looks like a good way to proceed in our opinion.

To maximize the parallelism, I am going to keep working on varnish-kafka to see if the code can be updated and @ema is going to evaluate the possibility of creating the prototype described above.

@ori, @Ottomata: any thoughts about what written above? Your input would be really valuable :)

Would it be feasible to just reuse varnishncsa and pipe its output (apache like log format) to a piece of software that just parse it, transform it in JSON and then publishes the result to a kafka topic?

Iiinnnteresting. At first the idea does not sound promising. From what I know, varnishkafka is doing it's best to be as efficient as possible, constructing messages directly out of VSL structures.

What about modifying varnisncsa to add librdkafka integration? Before we used varnishkafka, we modified the old varnishncsa to be able to emit via UDP instead of just to files.

varnishkafka goes out of its way to be flexible with format strings, allowing us to specify how to transform VSL entires into a JSON object. I'd hope that if VSL entries were addressable within this new 'transaction', it would be easy enough to reuse the formatting logic. But, I haven't looked at the new code, so I'm not sure.

@Ottomata the code is really optimized as you were saying to map VSL structures to log tags in apache format, but it is essentially what ncsa does. Our idea is that we should be decoupled by Varnish development as much as possible, because maintaining varnish-kafka might become (or it already is) a big burden for us. I will keep working on it, this idea is something to think on :)

Moreover I set up varnish 4 and apache on a debian VM, and used:

https://github.com/xcir/python-varnishapi/blob/varnish40/sample/sample_varnishlog.py

Result on a simple GET request, using VSL grouping by request:

Note: a single request generated all these logs. The main transaction is vxid:32773 with parent 0, and then at some point a backend transaction is fired with new vxid:32774 and parent vxid:32774).

vagrant@debian-jessie:~/python-varnishapi/sample$ sudo python2.7 sample_varnishlog.py
level:1 vxid:32773 vxid_parent:0 tag:Begin var: type:c data:req 32772 rxreq (isbin=0,len=16)
level:1 vxid:32773 vxid_parent:0 tag:Timestamp var: type:c data:Start: 1457015886.971352 0.000000 0.000000 (isbin=0,len=43)
level:1 vxid:32773 vxid_parent:0 tag:Timestamp var: type:c data:Req: 1457015886.971352 0.000000 0.000000 (isbin=0,len=41)
level:1 vxid:32773 vxid_parent:0 tag:ReqStart var:client.ip type:c data:::1 48736 (isbin=0,len=10)
level:1 vxid:32773 vxid_parent:0 tag:ReqMethod var:req.method type:c data:GET (isbin=0,len=4)
level:1 vxid:32773 vxid_parent:0 tag:ReqURL var:req.url type:c data:/ (isbin=0,len=2)
level:1 vxid:32773 vxid_parent:0 tag:ReqProtocol var:req.proto type:c data:HTTP/1.1 (isbin=0,len=9)
level:1 vxid:32773 vxid_parent:0 tag:ReqHeader var:req.http.User-Agent type:c data:User-Agent: curl/7.38.0 (isbin=0,len=24)
level:1 vxid:32773 vxid_parent:0 tag:ReqHeader var:req.http.Host type:c data:Host: localhost:6081 (isbin=0,len=21)
level:1 vxid:32773 vxid_parent:0 tag:ReqHeader var:req.http.Accept type:c data:Accept: */* (isbin=0,len=12)
level:1 vxid:32773 vxid_parent:0 tag:ReqHeader var:req.http.X-Forwarded-For type:c data:X-Forwarded-For: ::1 (isbin=0,len=21)
level:1 vxid:32773 vxid_parent:0 tag:VCL_call var: type:c data:RECV (isbin=0,len=5)
level:1 vxid:32773 vxid_parent:0 tag:VCL_return var: type:c data:hash (isbin=0,len=5)
level:1 vxid:32773 vxid_parent:0 tag:VCL_call var: type:c data:HASH (isbin=0,len=5)
level:1 vxid:32773 vxid_parent:0 tag:VCL_return var: type:c data:lookup (isbin=0,len=7)
level:1 vxid:32773 vxid_parent:0 tag:VCL_call var: type:c data:MISS (isbin=0,len=5)
level:1 vxid:32773 vxid_parent:0 tag:VCL_return var: type:c data:fetch (isbin=0,len=6)
level:1 vxid:32773 vxid_parent:0 tag:Link var: type:c data:bereq 32774 fetch (isbin=0,len=18)
level:1 vxid:32773 vxid_parent:0 tag:Timestamp var: type:c data:Fetch: 1457015886.972243 0.000890 0.000890 (isbin=0,len=43)
level:1 vxid:32773 vxid_parent:0 tag:RespProtocol var:resp.proto type:c data:HTTP/1.1 (isbin=0,len=9)
level:1 vxid:32773 vxid_parent:0 tag:RespStatus var:resp.status type:c data:200 (isbin=0,len=4)
level:1 vxid:32773 vxid_parent:0 tag:RespReason var:resp.reason type:c data:OK (isbin=0,len=3)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Date type:c data:Date: Thu, 03 Mar 2016 14:38:06 GMT (isbin=0,len=36)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Server type:c data:Server: Apache/2.4.10 (Debian) (isbin=0,len=31)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Last-Modified type:c data:Last-Modified: Thu, 03 Mar 2016 13:52:55 GMT (isbin=0,len=45)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.ETag type:c data:ETag: "2b60-52d2550c6c8ce-gzip" (isbin=0,len=32)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Vary type:c data:Vary: Accept-Encoding (isbin=0,len=22)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Content-Encoding type:c data:Content-Encoding: gzip (isbin=0,len=23)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Content-Length type:c data:Content-Length: 3078 (isbin=0,len=21)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Content-Type type:c data:Content-Type: text/html (isbin=0,len=24)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.X-Varnish type:c data:X-Varnish: 32773 (isbin=0,len=17)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Age type:c data:Age: 0 (isbin=0,len=7)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Via type:c data:Via: 1.1 varnish-v4 (isbin=0,len=20)
level:1 vxid:32773 vxid_parent:0 tag:RespUnset var:unset resp.http.ETag type:c data:ETag: "2b60-52d2550c6c8ce-gzip" (isbin=0,len=32)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.ETag type:c data:ETag: W/"2b60-52d2550c6c8ce-gzip" (isbin=0,len=34)
level:1 vxid:32773 vxid_parent:0 tag:VCL_call var: type:c data:DELIVER (isbin=0,len=8)
level:1 vxid:32773 vxid_parent:0 tag:VCL_return var: type:c data:deliver (isbin=0,len=8)
level:1 vxid:32773 vxid_parent:0 tag:Timestamp var: type:c data:Process: 1457015886.972262 0.000910 0.000020 (isbin=0,len=45)
level:1 vxid:32773 vxid_parent:0 tag:RespUnset var:unset resp.http.Content-Encoding type:c data:Content-Encoding: gzip (isbin=0,len=23)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Accept-Ranges type:c data:Accept-Ranges: bytes (isbin=0,len=21)
level:1 vxid:32773 vxid_parent:0 tag:RespUnset var:unset resp.http.Content-Length type:c data:Content-Length: 3078 (isbin=0,len=21)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Content-Length type:c data:Content-Length: 11104 (isbin=0,len=22)
level:1 vxid:32773 vxid_parent:0 tag:Debug var: type:c data:RES_MODE 42 (isbin=2,len=12)
level:1 vxid:32773 vxid_parent:0 tag:RespHeader var:resp.http.Connection type:c data:Connection: keep-alive (isbin=0,len=23)
level:1 vxid:32773 vxid_parent:0 tag:Gzip var: type:c data:U D - 3078 11104 80 80 24554 (isbin=0,len=29)
level:1 vxid:32773 vxid_parent:0 tag:Timestamp var: type:c data:Resp: 1457015886.972353 0.001001 0.000090 (isbin=0,len=42)
level:1 vxid:32773 vxid_parent:0 tag:ReqAcct var: type:c data:78 0 78 333 11104 11437 (isbin=0,len=24)
level:1 vxid:32773 vxid_parent:0 tag:End var: type:c data: (isbin=0,len=1)
level:2 vxid:32774 vxid_parent:32773 tag:Begin var: type:b data:bereq 32773 fetch (isbin=0,len=18)
level:2 vxid:32774 vxid_parent:32773 tag:Timestamp var: type:b data:Start: 1457015886.971398 0.000000 0.000000 (isbin=0,len=43)
level:2 vxid:32774 vxid_parent:32773 tag:BereqMethod var:bereq.method type:b data:GET (isbin=0,len=4)
level:2 vxid:32774 vxid_parent:32773 tag:BereqURL var:bereq.url type:b data:/ (isbin=0,len=2)
level:2 vxid:32774 vxid_parent:32773 tag:BereqProtocol var:bereq.proto type:b data:HTTP/1.1 (isbin=0,len=9)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.User-Agent type:b data:User-Agent: curl/7.38.0 (isbin=0,len=24)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.Host type:b data:Host: localhost:6081 (isbin=0,len=21)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.Accept type:b data:Accept: */* (isbin=0,len=12)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.X-Forwarded-For type:b data:X-Forwarded-For: ::1 (isbin=0,len=21)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.Accept-Encoding type:b data:Accept-Encoding: gzip (isbin=0,len=22)
level:2 vxid:32774 vxid_parent:32773 tag:BereqHeader var:bereq.http.X-Varnish type:b data:X-Varnish: 32774 (isbin=0,len=17)
level:2 vxid:32774 vxid_parent:32773 tag:VCL_call var: type:b data:BACKEND_FETCH (isbin=0,len=14)
level:2 vxid:32774 vxid_parent:32773 tag:VCL_return var: type:b data:fetch (isbin=0,len=6)
level:2 vxid:32774 vxid_parent:32773 tag:BackendOpen var: type:b data:18 boot.default 127.0.0.1 80 127.0.0.1 55066 (isbin=0,len=45)
level:2 vxid:32774 vxid_parent:32773 tag:Timestamp var: type:b data:Bereq: 1457015886.971473 0.000075 0.000075 (isbin=0,len=43)
level:2 vxid:32774 vxid_parent:32773 tag:Timestamp var: type:b data:Beresp: 1457015886.972089 0.000691 0.000616 (isbin=0,len=44)
level:2 vxid:32774 vxid_parent:32773 tag:BerespProtocol var:beresp.proto type:b data:HTTP/1.1 (isbin=0,len=9)
level:2 vxid:32774 vxid_parent:32773 tag:BerespStatus var:beresp.status type:b data:200 (isbin=0,len=4)
level:2 vxid:32774 vxid_parent:32773 tag:BerespReason var:beresp.reason type:b data:OK (isbin=0,len=3)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Date type:b data:Date: Thu, 03 Mar 2016 14:38:06 GMT (isbin=0,len=36)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Server type:b data:Server: Apache/2.4.10 (Debian) (isbin=0,len=31)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Last-Modified type:b data:Last-Modified: Thu, 03 Mar 2016 13:52:55 GMT (isbin=0,len=45)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.ETag type:b data:ETag: "2b60-52d2550c6c8ce-gzip" (isbin=0,len=32)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Accept-Ranges type:b data:Accept-Ranges: bytes (isbin=0,len=21)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Vary type:b data:Vary: Accept-Encoding (isbin=0,len=22)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Content-Encoding type:b data:Content-Encoding: gzip (isbin=0,len=23)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Content-Length type:b data:Content-Length: 3078 (isbin=0,len=21)
level:2 vxid:32774 vxid_parent:32773 tag:BerespHeader var:beresp.http.Content-Type type:b data:Content-Type: text/html (isbin=0,len=24)
level:2 vxid:32774 vxid_parent:32773 tag:TTL var: type:b data:RFC 120 10 -1 1457015887 1457015887 1457015886 0 0 (isbin=0,len=51)
level:2 vxid:32774 vxid_parent:32773 tag:VCL_call var: type:b data:BACKEND_RESPONSE (isbin=0,len=17)
level:2 vxid:32774 vxid_parent:32773 tag:VCL_return var: type:b data:deliver (isbin=0,len=8)
level:2 vxid:32774 vxid_parent:32773 tag:Storage var: type:b data:malloc s0 (isbin=0,len=10)
level:2 vxid:32774 vxid_parent:32773 tag:ObjProtocol var:obj.proto type:b data:HTTP/1.1 (isbin=0,len=9)
level:2 vxid:32774 vxid_parent:32773 tag:ObjStatus var:obj.status type:b data:200 (isbin=0,len=4)
level:2 vxid:32774 vxid_parent:32773 tag:ObjReason var:obj.reason type:b data:OK (isbin=0,len=3)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Date type:b data:Date: Thu, 03 Mar 2016 14:38:06 GMT (isbin=0,len=36)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Server type:b data:Server: Apache/2.4.10 (Debian) (isbin=0,len=31)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Last-Modified type:b data:Last-Modified: Thu, 03 Mar 2016 13:52:55 GMT (isbin=0,len=45)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.ETag type:b data:ETag: "2b60-52d2550c6c8ce-gzip" (isbin=0,len=32)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Vary type:b data:Vary: Accept-Encoding (isbin=0,len=22)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Content-Encoding type:b data:Content-Encoding: gzip (isbin=0,len=23)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Content-Length type:b data:Content-Length: 3078 (isbin=0,len=21)
level:2 vxid:32774 vxid_parent:32773 tag:ObjHeader var:obj.http.Content-Type type:b data:Content-Type: text/html (isbin=0,len=24)
level:2 vxid:32774 vxid_parent:32773 tag:Fetch_Body var: type:b data:3 length stream (isbin=0,len=16)
level:2 vxid:32774 vxid_parent:32773 tag:Gzip var: type:b data:u F - 3078 11104 80 80 24554 (isbin=0,len=29)
level:2 vxid:32774 vxid_parent:32773 tag:BackendReuse var: type:b data:18 boot.default (isbin=0,len=16)
level:2 vxid:32774 vxid_parent:32773 tag:Timestamp var: type:b data:BerespBody: 1457015886.972224 0.000827 0.000135 (isbin=0,len=48)
level:2 vxid:32774 vxid_parent:32773 tag:Length var: type:b data:3078 (isbin=0,len=5)
level:2 vxid:32774 vxid_parent:32773 tag:BereqAcct var: type:b data:141 0 141 283 3078 3361 (isbin=0,len=24)
level:2 vxid:32774 vxid_parent:32773 tag:End var: type:b data: (isbin=0,len=1)

the code is really optimized as you were saying to map VSL structures to log tags in apache format, but it is essentially what ncsa does.

Indeed! Magnus modeled varnishkafka after varnishncsa, but added kafka + json formatting support.

Our idea is that we should be decoupled by Varnish development as much as possible

Sounds good, perhaps it won't matter, but one of the big reasons for doing it the way Magnus did was for maximum performance on the varnishes. The varnish caches are crucial to WMF infrastructure, and Opsen at the time wanted to avoid adding as much extra overhead on them as we could. Going directly from VSL in memory to librdkafka queues was pretty efficient. VSL is already an in in memory queue, and the VSL API allows you to effectively follow entries in memory as if you were tailing them. If you introduce a middle layer, things could get more complicated, and less efficient.

What about the idea of modifying varnishncsa for varnish 4 to add JSON formatting and librdkafka support?

What about the idea of modifying varnishncsa for varnish 4 to add JSON formatting and librdkafka support?

Definitely a good option, now that we have validated our assumptions with python-varnish about how the new logging scheme works it might be easier to just modify varnish-kafka, even if the C code is heavy to read. I'll spend tomorrow and Monday trying to port varnish-kafka trying to obtain minimal results, if I'll fail we'll discuss about other options. Sounds good?

the code is really optimized as you were saying to map VSL structures to log tags in apache format, but it is essentially what ncsa does.

Indeed! Magnus modeled varnishkafka after varnishncsa, but added kafka + json formatting support.

While it is modelled conceptually, at the time it was more than adding Kafka/JSON support — Magnus also focused heavily on performance, at a time where varnishncsa was terrible at it (I'm unsure if it's better these days). varnishkafka was much more performant even when outputing NCSA-formatted lines into stdout. I don't recall numbers but I think varnishncsa (with UDP) was consuming significant amounts of CPU back then (double digits for sure).

Piping varnishncsa's output to a JSON encoder and a Kafka producer will definitely have a performance hit compared to varnishkafka, both due to probably an unoptimized varnishncsa and due to the fact that data would have to be encoded twice and decoded once (not to mention the fact that tabular formats are error-prone; it took us a major effort to get rid of them).

If you want to work on more than the short-term (= porting varnishkafka to 4.0) that's highly appreciated of course :) Maintainability is a long-term concern and it's an important one, as high as performance is for sure.

For this, my own recommendation would be to either add JSON & librdkafka support upstream, as @Ottomata suggested (librdkafka is widely supported these days, e.g. rsyslog and syslog-ng both have output plugins for it), or, alternatively, if upstreams rejects that due to the added complexity, perhaps they could be persuaded of the usefulness of providing API and/or VSL guarantees to third-parties s othat alternative log processors can be more easily maintaied. I'm hoping they won't say "no" to both expanding varnishncsa *and* keeping the VSL structures internal to Varnish.

Ja! Am happy to help too! I’ve poked around in that code a bit too, so I
might be able to offer some help.

@Ottomata and @faidon: thanks a lot for all the info!

For this, my own recommendation would be to either add JSON & librdkafka support upstream, as @Ottomata suggested (librdkafka is widely supported these days, e.g. rsyslog and syslog-ng both have output plugins for it), or, alternatively, if upstreams rejects that due to the added complexity, perhaps they could be persuaded of the usefulness of providing API and/or VSL guarantees to third-parties s othat alternative log processors can be more easily maintaied. I'm hoping they won't say "no" to both expanding varnishncsa *and* keeping the VSL structures internal to Varnish.

A bit of context that might have been lost in phab task: all the Varnish 4 bin utils like varnishncsa are built upon what they call VUT (Varnish Utils Library) that is essentially a wrapper around the main APIs. To give you an idea, this is the code (very simplified view) to print something like https://dpaste.de/16aO:

int myprint(struct VSL_data *vsl, struct VSL_transaction *const pt[], void *priv) {
    struct VSL_transaction *t;
    enum VSL_tag_e tag;
    const char *data;
    for (t = pt[0]; t != NULL; t = *++pt) {
        while (VSL_Next(t->c) == 1) {
            tag = VSL_TAG(t->c->rec.ptr);
            data = VSL_CDATA(t->c->rec.ptr);
            printf("%s: %s\n", VSL_tags[tag], data);
        }
     }
}

void main(int argc, char *argv[]) {
    VUT_Init("example");
    VUT.g_arg = VSL_g_request;
    VUT.dispatch_f = myprint;
    VUT_Setup();
    VUT_Main();
    VUT_Fini();
}

Source: https://www.varnish-cache.org/sites/default/files/Varnish%204%20Logging%20API.pdf

The code is super nice but the main problem is that is not officially exported as library in Varnish 4, because probably thought as "internal lib" for varnish bins only. Me and Ema have been able to add this library to varnishkafka and compile it successfully (without the vk parsing part that is still work in progress), but we had to import a lot of .c files into varnishkafka's source. The final version will definitely be better but at the moment it feels a bit like walking into mud: you can proceed but your boots keep getting less clean after each step :)

So, next steps:

  1. short term - finish the work on porting varnishkafka with VUT and send a preliminary code review to get your feedback. Eventually we could convince upstream to release the VUT as companion for the varnishapi. This might mean to have a "fatter" varnishkafka for a transitional period (plus some occasional monkey patching here and there). We also might think to use the API directly without VUT, but the code is a bit more complicated and I prefer to focus now on finish the porting of the parsing/transformation part.
  1. short term - @ema is going to check the varnish-python library (https://github.com/xcir/python-varnishapi/tree/varnish40) to establish its performances (it uses Ctypes and the varnishapi C lib). There are also other tools that need to migrated to varnish 4, for example the one that pushes datapoints to statsd (varnishlog if I remember correctly). This could be a good use case to see if varnish-kafka could potentially be migrated to python with an acceptable performance hit.
  1. medium-long term - still to be decided, needs more datapoints. I agree 100% that varnishncsa could be extended with all the features that vk has except the kafka part, I don't think that it is entirely in the tool's scope (but we will definitely ask to upstream).

Hope that I made a good summary without saying anything super stupid, please correct me if anything is missing or if you need more info!

but we had to import a lot of .c files into varnishkafka's source.

An alternative possibly more palatable than importing random C files to get VUT working in VK, is to simply import VK into the varnish source tree (that is, make it a patch we apply to our varnish package, such that VK is actually built in-tree like varnishncsa is).

Regardless of which way we go there: does this create licensing issues (are vk+varnish license-compat)? Keep in mind whatever we produce here isn't internal-only - we're also re-publishing the result as open source implicitly.

I think we have control over the varnishkafka license, so if we go that route and need to change it, we can.

Update:

  1. completed the porting of the new tags. After a chat with Brandon and Ema we decided to use only the "client" tags discarding the backend ones since we don't really need them (new varnish logging makes a clear separation between them). There will be probably some adjustments to make while testing them with real data.
  1. I managed to removed the VUT dependency from vk (so no more random c files imported from Varnish source) but I need to manage some getopt args now (like -r to select specific shm, etc..) that were previously automagically handled by VUT.

I will try to send a code review soon, but we are on the right track for the moment!

Change 276439 had a related patch set uploaded (by Elukey):
First draft for the Varnish 4 porting.

https://gerrit.wikimedia.org/r/276439

Status update:

DONE - Tested the code in mediawiki-vagrant and discovered some bugs, fixed in the latest patchset. Now varnishkafka is stable at 0.3 cpu utilization and <= 2% mem utilization even under load.

DONE - Ottomata tried the code but didn't manage to run it on mediawiki vagrant due to an initialization error. It should be a minor bug but I'll follow up with him to doble check.

Next steps:

Change 274135 abandoned by Ema:
Preiliminary port to new VSL API

Reason:
Obsoleted by https://gerrit.wikimedia.org/r/#/c/276439/

https://gerrit.wikimedia.org/r/274135

Milimetric set the point value for this task to 34.Mar 17 2016, 4:28 PM

Change 280198 had a related patch set uploaded (by Elukey):
Varnish 4 API porting.

https://gerrit.wikimedia.org/r/280198

Change 280202 had a related patch set uploaded (by Ema):
Varnishkafka: use VSL query on Varnish 4

https://gerrit.wikimedia.org/r/280202

Change 280202 merged by Ema:
Varnishkafka: use VSL query on Varnish 4

https://gerrit.wikimedia.org/r/280202

Change 280452 had a related patch set uploaded (by Elukey):
Add a new configuration file specific for Varnish 4

https://gerrit.wikimedia.org/r/280452

Change 280452 merged by Elukey:
Add a new configuration file specific for Varnish 4

https://gerrit.wikimedia.org/r/280452

Change 280455 had a related patch set uploaded (by Elukey):
Update the varnishkafka module to add a Varnish 4 compatible configuration template.

https://gerrit.wikimedia.org/r/280455

Change 280455 merged by Elukey:
Update the varnishkafka module to add a Varnish 4 compatible configuration template.

https://gerrit.wikimedia.org/r/280455

Change 280459 had a related patch set uploaded (by Elukey):
Add correct varnishkafka configuration files for Varnish 4 servers.

https://gerrit.wikimedia.org/r/280459

Summary of last updates:

  1. varnish-kafka code changes in https://gerrit.wikimedia.org/r/#/c/276439/ and https://gerrit.wikimedia.org/r/#/c/276439/ have been packaged by ema and installed (with Varnish 4) on cp1043 (maps cache) and preliminary results are very encouraging. No memory leaks (previously noticed by me on Vagrant VM testing) for the moment but a small bug due to Varnish tags sequences like:
RespUnset      X-Analytics:
RespHeader     X-Analytics: ;https=1
RespUnset      X-Analytics: ;https=1
RespHeader     X-Analytics: ;https=1;nocookies=1
RespUnset      X-Analytics: ;https=1;nocookies=1
RespHeader     X-Analytics: https=1;nocookies=1

The current vk code picks only the first one and doesn't override it later on (or more correctly considering RespUnset). BBlack patched the issue with https://gerrit.wikimedia.org/r/#/c/280612. The patch resembles what varnishncsa does, namely simply overriding tags picking up the last value. The downside is that if a RespUnset is the last tag (so theoretically cancelling all the previous ones) vk and ncsa will display the correspondent RespX tag anyway. We will try to fix this behavior but for the moment BBlack patch is more than enough for our purposes.

  1. BBlack also sent some code reviews (https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/software/varnish/varnishkafka+branch:master+topic:276439,n,z) to improve the vk's code that will need to be merged eventually (they would bring a lot of value in my opinion). Downside: no tests in vk codebase means that every change is risky.

Change 280459 merged by Elukey:
Add correct varnishkafka configuration files for Varnish 4 servers.

https://gerrit.wikimedia.org/r/280459

Change 280651 had a related patch set uploaded (by Elukey):
Remove unused feature in varnishkafka for Varnish 4

https://gerrit.wikimedia.org/r/280651

Change 280651 merged by Elukey:
Remove unused feature in varnishkafka for Varnish 4

https://gerrit.wikimedia.org/r/280651

Change 280653 had a related patch set uploaded (by Elukey):
Update the varnishkafka module

https://gerrit.wikimedia.org/r/280653

Change 280653 merged by Elukey:
Update the varnishkafka module

https://gerrit.wikimedia.org/r/280653

Update: we added all the patches (mine + BBlack's) to the latest version of vk's deb package and installed it on cp1043/cp1044 hosts (Maps cluster). No memory leak registered and the Analytics team double checked Kafka's topics to verify data integrity, and everything looks good.

I would prefer to keep this task open a bit to gather more data and finally merge then new code to the vk repo before closing.

Change 276439 merged by Ema:
Remove loglines cache to mitigate a possible memory leak.

https://gerrit.wikimedia.org/r/276439

Change 281413 had a related patch set uploaded (by Elukey):
Add info to the varnishkafka README after the Varnish 4 porting.

https://gerrit.wikimedia.org/r/281413

Change 281413 merged by Elukey:
Add info to the varnishkafka README after the Varnish 4 porting.

https://gerrit.wikimedia.org/r/281413

Code merged by ema, plus the varnish maps cluster has been running with vk for days without triggering any problem. Closing this task! We will deal with issues that might arise on new tasks.