Page MenuHomePhabricator

Varnishkafka does not play well with varnish 5.2
Closed, ResolvedPublic

Description

Varnishkafka does not play well with varnish 5.2

Event Timeline

ping @elukey is this still true? I though we migrated to varnish 5 awhile back

Yeah, we upgraded to 5.1.3 but 5.2 is completely different:

https://varnish-cache.org/docs/5.2/whats-new/changes-5.2.html#vsm-vsc-api-changes

There are some changes in the way the shared log is handled, therefore varnishkafka needs to change accordingly. They should have released a library to handle most of the work though (VUT), that previously was available only for their internal tools like varnishlog, so it shouldn't be that hard this time (last famous words).

The traffic team is aware of this issue, they will ping us ahead of time when/if the migration to 5.2 will happen :)

Milimetric triaged this task as Medium priority.Apr 2 2018, 3:33 PM
Milimetric triaged this task as Medium priority.
Milimetric moved this task from Wikistats to Operational Excellence on the Analytics board.
Milimetric added a subscriber: Ottomata.

From what I can see, the change seem pretty straightforward (call to VUT_Init/VUT_Setup to replace all non kafka setup and the VUT_Main as main loop) except for the polling of kafka (call to rd_kafka_poll). There doesn't seem to be a hook for this kind of periodic stuff in the VUT API.

The simplest would be to create another thread for this polling since "VUT_Main" is the main loop and join the threads on exit before the call to rd_kafka_destroy. Since we will be switching to varnish 5.2 soon, we might start working on the change before you do. Will keep you informed of things on our side when I have more info.

I am trying to port it right now, and as always with software I discover new stuff as I go along.
Need to review argument parsing (luckily you decided to be 1-1 compatible with varnishncsa so it saves a lot of effort).
Need to review signal handling too, this might get tricky since VUT is handling signals and notifying the code via callback, but from what I see SIG_PIPE is not used by VUT so we can leave it disabled. SIG_INT SIG_TERM SIG_HUP and SIG_USR1 are handled by VUT so a single callback will have to perform stuff for varnishkafka.

Apart from that a lot of code will go away for setup and handling of varnish logs, which is good.

Thanks a lot for all the research work, I am following your progress and I'll try to dedicate some time in reviewing the code during this month!

Thanks, however we are hitting an issue where with varnish 5.2 we are getting Segfaults... We tried two paths:

  1. porting the changes in VSL/VSM APIs
  2. changing to VUT

both gets us segfaults so something undefined changed in the API that does not bode well with varnishkafka. I tracked it to https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1355-L1356 where the condition is true but ptr is null. so the memcpy later does a null-pointer dereference. I tried some asserts in match_assign but nothing matches.

Found it... it was my fault no issue with the API.

I also found out that there is a "vut->idle_f" callback that can be set for rd_kafka_poll so it will be easier than what I thought no need to different thread.

I created a review https://gerrit.wikimedia.org/r/#/c/430069/ I am still lacking some testing particularly with regards to kafka handling.

According to my testing it is ok.
I updated my older review for format allocation... forgot about that one.

First part of the test done today, basically adapting our integration testing suite:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka#Testing_a_code_change

Those are basically two tests running with a simple varnish -> apache configuration, and grabbing varnishkafka's output checking that what's emitted is the expected result. One test also runs valgrind, the report looks ok:

==418== LEAK SUMMARY:
==418==    definitely lost: 688 bytes in 1 blocks
==418==    indirectly lost: 0 bytes in 0 blocks
==418==      possibly lost: 1,120 bytes in 1 blocks
==418==    still reachable: 8,629 bytes in 66 blocks
==418==         suppressed: 0 bytes in 0 blocks
==418==

Some of the above warnings are not actually a big issue, and the numbers stay constants varying the number of requests (I wanted to make sure that no accidental leak was added).

Last test will be adding Kafka to the mix, will try to do it asap but for the moment everything looks good (I might also just add directly the integration test in the suite).

I tested varnishkafka with Varnish 6.0, got from https://packagecloud.io/varnishcache/varnish60/install

Tested with varnish 6.0 + https://docs.confluent.io/current/quickstart/cos-quickstart.html on a Vagrant VM, everything seems ok after multiple stress tests. I think that we are ok to merge :)

Change 442046 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/varnish/varnishkafka@master] Update wikimedia's copiright and current branch policy

https://gerrit.wikimedia.org/r/442046

Change 442046 merged by Elukey:
[operations/software/varnish/varnishkafka@master] Update wikimedia's copiright and current branch policy

https://gerrit.wikimedia.org/r/442046

Created the new varnishv51 branch and added some info in the README about the new branch policy. I think that everything is done, thanks a lot @R4q3NWnUx2CEhVyr for the work and the patience!