Page MenuHomePhabricator

Decide on librdkafka deployment model for k8s services
Open, Needs TriagePublic

Description

Background

We are preparing to move the changepropagation and cpjobqueue services off the SCB servers to K8s. The SCB servers will be decommissioned in April 2020. The initial approach was to use the Node 6 image from the registry (to keep the environment as close to the current deployment environment and only have to worry about the migration), install the librdkafka and librdkafka++ package through apt, and then deploy the app code.

Node 6 as well as Node 8 seem to have a problem with the configuration as the service segfaults right away. After some investigation, we concluded it is likely not a configuration issue (missing package/setting). What does work in Node 6 is to not use the .deb packages but build the librdkafka library that comes with the node package. The Node 10 image on the other hand which also comes with 0.11.6 showed no issues when using the .deb packages.

Which leaves us with 2 possible solutions:

  1. Use the Node 10 image instead
  1. Compile the librdkafka library that comes with node-rdkafka package and then be able to use Node 6

Purpose

We wanted to get input from ops as to what is preferred going forward. The move to Node 10 is already planned as soon as the migration is completed. It's a temporary higher risk since we are changing platforms, libraries, and node versions at the same time.

The second approach allows us to stay with 6 until after the migration. It does have some benefits beyond that since the version that comes bundled with the node-rdkafka package contains newer versions of the librdkafka library. Only 2 node-rdkafka packages support 0.11.6: 2.6.1 and 2.7.0. The latest 2.7.4 is bundled with 1.2.2. This would allow us to take advantage of any improvements without having to worry when upstream packages are updated. Also, since it is a one-time build of the library in a build container to create the app image it also fits the K8s/Docker paradigm.

We wanted to get some input what everybody thinks about either approach and which is preferred.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2020, 11:17 PM

I'm for it!

I'd like to be able to use the librdkafka version that makes sense for the service, especially now that we have containerization. Depending on the .deb package can slow things down.

BTW, I've got a patch in review for EventStreams helm charts: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551843, you might be able to base some work off of it.

Unrelated:
When writing the CP helm chart(s), if you don't update to use the new WIP service-runner based prometheus metrics, you'll have to write statsd-exporter regex rules to convert your existing statsd metrics into prometheus ones, which isn't really that simple. You might want to consider basing your change on the WIP service-runner stuff so you don't have to do that. This will probably result in new metrics in prometheus, which means new dashboarding and alerts, but at least you can deploy to k8s and get that stuff working before turning off the scb ones.