Page MenuHomePhabricator

Update node-rdkafka version to v2.x
Closed, ResolvedPublic

Description

There's a v2 of node-rdkafka released. It has some backwards incompatible changes as well as it requires librdkafka v 0.11.1

Here's what needs to be done:

Event Timeline

I think @elukey and @Ottomata have some plans around the librdkafka version that needs to be deployed fleet-wide, since there's an implicit dependency to the Kafka TLS work. Is node-rdkafka's dependency on 0.9.5 specifically, or >= 0.9? Can we use 0.11.0?

I can help with a backport if needed, but let's hear from those folks first regarding which version we'll need and how to best deploy it.

We'd have to check and see, but I believe that newer versions of librdkafka should work fine with any of the versions of Kafka that we use. Ideally all brokers would also be upgraded to latest 0.11 Kafka too, but I think it'll be a while before we get to the main clusters that change-prop uses.

Please don't install/deploy librdkafka 0.11 on nodes that will produce/consume to/from Kafka 0.9.X clusters since it will likely cause timeouts/lags on the brokers as happened in T172681 (unless explicitly configured to be compatible with 0.9.X, but it is not the default).

This is the main reason: https://github.com/edenhill/librdkafka/wiki/Broker-version-compatibility

So librdkafka 0.9.4 has all the right tunables for Kafka brokers <= 0.9.X, meanwhile more recent versions can be used for more recent Kafka clusters (like the Jumbo one).

We don't have a clear rollout/upgrade plan for librdkafka yet :)

Luca, I thought the ultimate problem was that we couldn't properly set the kafka api.version using the version pmacct we had?

I tried to force the Kafka API version via the above two patches but it doesn't work, the version of pmacct that supports librdkafka parameter settings is 1.6.2 and we have 1.6.1

If all clients set api.version to 0.9.x, it should be safe to use librdkafka 0.11 with 0.9 brokers, no?

Note that we are talking about librdkafka v0.9.5 here ;) node-rdkafka does not support the v0.11 version (yet).

Luca, I thought the ultimate problem was that we couldn't properly set the kafka api.version using the version pmacct we had?

I tried to force the Kafka API version via the above two patches but it doesn't work, the version of pmacct that supports librdkafka parameter settings is 1.6.2 and we have 1.6.1

If all clients set api.version to 0.9.x, it should be safe to use librdkafka 0.11 with 0.9 brokers, no?

Correct! I put a note " (unless explicitly configured to be compatible with 0.9.X, but it is not the default)" but I wanted to warn people since the last time it was a bit of nightmare to track it down :)

mobrovac renamed this task from Update node-rdkafka version to 2.0 to Update node-rdkafka version to v2.x.Nov 15 2017, 3:17 PM
mobrovac triaged this task as High priority.
mobrovac edited projects, added Services (blocked); removed Services (next).
mobrovac updated the task description. (Show Details)
mobrovac removed a subscriber: Aklapper.

@faidon back-ported librdkafka v0.11.1. Next step involves adapting ChangeProp to the new v2 driver and testing everything in Labs.

We've also got librdkafka 0.11 backported for Jessie in our apt repo now too. I don't see any blockers to using it. I've tested EventStreams with it locally.

It is also installed on cp1008.wikimedia.org (cache canary) and used by varnishkafka there.

Hm, so librdkafka 0.11 getting on beta might be the reason of T185016 ?

Change 404540 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Ensure specific librdkafka version for changeprop and eventstreams

https://gerrit.wikimedia.org/r/404540

We've also got librdkafka 0.11 backported for Jessie in our apt repo now too. I don't see any blockers to using it. I've tested EventStreams with it locally.

This caused librdkafka to get updated on SCB yesterday, and this new version is incompatible with our software stack running there. I think the wisest thing to do is to remove the new version from the repo until we are ready to switch. Otherwise we left vulnerable to CP and CP-JQ outages, and given all the travelling everyone is about to do, it doesn't seem wise to keep the status quo.

I'd prefer to pin the version in puppet, than restrict it everywhere for SCB. If we fix up that patch to something more acceptable, would you be ok with that @mobrovac?

I would prefer having a stable and known-to-work version everywhere rather than confining the problem to SCB

Yesterday during the deployment of JobQueue CP instance we've had a small incident because of the disparity between the librdkafka installed on the machines on SCB and node-rdkafka. So we've had to rever a commit in deploy repo that updated to node-rdkafka v2 https://gerrit.wikimedia.org/r/#/c/408441/

The code has been changed to revert the update to new node-rdkafka https://github.com/wikimedia/change-propagation/pull/230 so now the state of the code is pretty weird. We know change-prop works with librdkafka 0.11 and node-rdkafka v2, so we need to update SCB ASAP.

The only blocker right now is checking whether event streams service works. @Ottomata could you test event streams please so that we could coordinate the upgrade?

mobrovac updated the task description. (Show Details)

Mental note: we will also have to update librdkafka in Vagrant once all of the services are ported to node-rdkafka v2.

Change 408704 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Update version of kafkasse to 0.1.1 to use node-rdkafka 2.2.1 and librdkafka 0.11

https://gerrit.wikimedia.org/r/408704

Change 408704 merged by Ottomata:
[mediawiki/services/eventstreams@master] Update version of kafkasse to 0.1.1 to use node-rdkafka 2.2.1 and librdkafka 0.11

https://gerrit.wikimedia.org/r/408704

Change 408831 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set api.version.request: false for eventstreams

https://gerrit.wikimedia.org/r/408831

Change 408831 merged by Ottomata:
[operations/puppet@production] Set api.version.request: false for eventstreams

https://gerrit.wikimedia.org/r/408831

The SCB cluster has been upgraded today and the relevant services deployed (a big shout out to @Pchelolo and @Ottomata). The last piece of the puzzle is updating MediaWiki-Vagrant .

Change 404540 abandoned by Ottomata:
Ensure specific librdkafka version for changeprop and eventstreams

https://gerrit.wikimedia.org/r/404540

Vagrant is a bit specific. By default the node-rdkafka just clones librdkafka sources into a subfolder and then build librdkafka package from scratch, statically linking to it, so technically we don't even need to install librdkafka-dev package in Vagrant - everything works out of the box.

For building production deploys we override that behavior with a special environment variable and then dynamically link to librdkafka packages installed from the apt repo.

In order for Vagrant to match production better, we might consider setting the env variable there and depend on librdkafka installed from the apt repo as well, but that has it's own problems - Vagrant is on Debian scratch and we don't have librdkafka 0.11.3 packaged for scratch.

I'm not sure whether this worths an effort, @mobrovac @Ottomata what do you think?

Vagrant is on Debian scratch and we don't have librdkafka 0.11.3 packaged for scratch

You mean Stretch? And we do :)

BUt I also think it might not be worth the effort on mw-vagrant. mw-vagrant is a dev env, and it'd be good for individual dev packages to manage their own dependencies.

Agreed, that's what I've been thinking as well. So, resolving the ticket!