Page MenuHomePhabricator

Check if node-rdkafka's version on changeprop can be upgraded from 2.8.1
Closed, DuplicatePublic

Description

The version of node-rdkafka used in changeprop should be 2.8.1, released more than 3 years ago. The latest upstream is 2.16.1, that supports a more up to date version of librdkafka and that possibly has a ton of performance bugs fixed.

List of tags: https://github.com/Blizzard/node-rdkafka/tags

We should review the changelogs, and figure out:

  1. If the recent versions are compatible with our nodejs version.
  2. If anything major prevents us from upgrading.
  3. What is the best upgrade path?

Event Timeline

Aklapper renamed this task from Check if node-rdkafka's version on changeprop can be upgraded to Check if node-rdkafka's version on changeprop can be upgraded from 2.8.1.Jul 5 2023, 6:22 PM

https://github.com/Blizzard/node-rdkafka/tree/master/deps seems to state that librdkafka is shipped as build dependency (this is consistent with what I see in the changeprop's blubber config, since we don't explicitly install the library).

Pretty sure there is a configurable env var BUILD_LIBRDKAFKA that can conditionally disable this. eventgate-wikimedia installs the librdkafka1 .deb and sets BUILD_LIBRDKAFKA=0

https://github.com/Blizzard/node-rdkafka/blob/8429f0cd84a506efde5919058027f4a7346c201e/binding.gyp#L4

Change 937894 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] changeprop: bump node-rdkafka, use buster base

https://gerrit.wikimedia.org/r/937894

Change 937894 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: bump node-rdkafka, use buster base

https://gerrit.wikimedia.org/r/937894

@hnowlan I checked and the kafka client should be safe to be upgraded, it doesn't use zookeeper or any old thing, so I guess that we can proceed with deploying to codfw if you are ok!

Change 941780 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: bump node-rdkafka, use buster base (prod version)

https://gerrit.wikimedia.org/r/941780

Change 941780 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: bump node-rdkafka, use buster base (prod version)

https://gerrit.wikimedia.org/r/941780

We noticed a sharp and sustained increase in cpu usage after the deployment to prod codfw, and we rolled it back.

The issue got unnoticed but it is in staging too:

Screenshot from 2023-07-27 17-48-08.png (2×1 px, 309 KB)

And we deployed on the 13th, so it is definitely something that plays out even without load (staging doesn't really process any traffic).

I tried to use perf on kubestage1004 and I got the following stacktrace:

#
# Children      Self       Samples  Command  Shared Object      Symbol                                                         
# ........  ........  ............  .......  .................  ...............................................................
#
    99.07%    99.07%             1  node     libnode.so.64      [.] v8::internal::Factory::NewJSArray
            |
            ---0x100000000
               uv__server_io
               node::ConnectionWrap<node::TCPWrap, uv_tcp_s>::OnConnection
               node::AsyncWrap::MakeCallback
               node::InternalMakeCallback
               v8::Function::Call
               v8::internal::Execution::Call
               0x7fdee1d2504d
               0x7fdee1d24b2c
               0x348ef3de3ea1
               0x348ef3d1bd8f
               0x348ef3d0de09
               0x348ef3d051ed
               0x348ef3d171e3
               0x348ef3dd464b
               0x7fdee1aa2687
               v8::internal::Factory::NewJSArray

    99.07%     0.00%             0  node     [unknown]          [.] 0x0000000100000000
            |
            ---0x100000000
               uv__server_io
               node::ConnectionWrap<node::TCPWrap, uv_tcp_s>::OnConnection
               node::AsyncWrap::MakeCallback
               node::InternalMakeCallback
               v8::Function::Call
               v8::internal::Execution::Call
               0x7fdee1d2504d
               0x7fdee1d24b2c
               0x348ef3de3ea1
               0x348ef3d1bd8f
               0x348ef3d0de09
               0x348ef3d051ed
               0x348ef3d171e3
               0x348ef3dd464b
               0x7fdee1aa2687
               v8::internal::Factory::NewJSArray

    99.07%     0.00%             0  node     libuv.so.1.0.0     [.] uv__server_io
            |
            ---uv__server_io
               node::ConnectionWrap<node::TCPWrap, uv_tcp_s>::OnConnection
               node::AsyncWrap::MakeCallback
               node::InternalMakeCallback
               v8::Function::Call
               v8::internal::Execution::Call
               0x7fdee1d2504d
               0x7fdee1d24b2c
               0x348ef3de3ea1
               0x348ef3d1bd8f
               0x348ef3d0de09
               0x348ef3d051ed
               0x348ef3d171e3
               0x348ef3dd464b

Very generic and not pointing to a specific direction, but I am wondering if the new node-rdkafka client doesn't play nice with nodejs-10.

Change 942634 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update changeprop's staging docker image

https://gerrit.wikimedia.org/r/942634

Change 942634 merged by Elukey:

[operations/deployment-charts@master] services: update changeprop's staging docker image

https://gerrit.wikimedia.org/r/942634

Reverted the os + node-rdkafka upgrade, and went for OS upgrade only (stretch to buster). In staging the CPU dropped and the network usage grew, so it is definitely an issue with node10 and node-rdkafka:

Screenshot from 2023-07-28 14-43-40.png (1×2 px, 62 KB)

Screenshot from 2023-07-28 14-44-01.png (1×2 px, 96 KB)

Change 942658 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/change-propagation@master] blubber: update to Debian Bullseye and Nodejs 12

https://gerrit.wikimedia.org/r/942658

Next steps:

  • roll out changeprop on buster and node10
  • test changeprop on bullseye + node12 in staging, and roll out in prod in case
  • test again node-rdkafka on node12

Change 943037 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: upgrade changeprop instances to Buster

https://gerrit.wikimedia.org/r/943037

Change 943037 merged by Elukey:

[operations/deployment-charts@master] services: upgrade changeprop instances to Buster

https://gerrit.wikimedia.org/r/943037

Change 942658 abandoned by Elukey:

[mediawiki/services/change-propagation@master] blubber: update to Debian Bullseye and Nodejs 12

Reason:

We should probably go to nodejs-18

https://gerrit.wikimedia.org/r/942658