Page MenuHomePhabricator

Upgrade change propagation to nodejs18
Closed, ResolvedPublic

Description

ChangeProp runs nodejs 10 and it should be moved to something more recent.

  • File change that passes unit tests
  • Deploy and test in Wikikube staging
  • Deploy to changeprop-prod eqiad
  • Deploy to changeprop-jobqueue eqiad
  • Deploy to changeprop-prod codfw
  • Deploy to changeprop-jobqueue codfw
  • Deploy changeprop in beta

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 966029 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/change-propagation@master] Upgrade to nodejs 18 on Debian Bookworm

https://gerrit.wikimedia.org/r/966029

npm warnings in CI:

#15 10.83 npm WARN EBADENGINE Unsupported engine {
#15 10.83 npm WARN EBADENGINE   package: 'eslint-plugin-jsdoc@39.2.2',
#15 10.83 npm WARN EBADENGINE   required: { node: '^14 || ^16 || ^17' },
#15 10.83 npm WARN EBADENGINE   current: { node: 'v18.13.0', npm: '9.2.0' }
#15 10.83 npm WARN EBADENGINE }
#15 10.83 npm WARN EBADENGINE Unsupported engine {
#15 10.83 npm WARN EBADENGINE   package: '@es-joy/jsdoccomment@0.23.6',
#15 10.83 npm WARN EBADENGINE   required: { node: '^12 || ^14 || ^16 || ^17' },
#15 10.83 npm WARN EBADENGINE   current: { node: 'v18.13.0', npm: '9.2.0' }
#15 10.83 npm WARN EBADENGINE }
#15 13.13 npm WARN deprecated kad-memstore@0.0.1: This package is no longer maintained.
#15 13.25 npm WARN deprecated kad-fs@0.0.4: This package is no longer maintained.
#15 13.39 npm WARN deprecated har-validator@5.1.5: this library is no longer supported
#15 13.59 npm WARN deprecated @hapi/bourne@1.3.2: This version has been deprecated and is no longer supported or maintained
#15 15.60 npm WARN deprecated request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
#15 15.68 npm WARN deprecated uuid@3.4.0: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
#15 15.69 npm WARN deprecated json-schema-ref-parser@7.1.4: Please switch to @apidevtools/json-schema-ref-parser
#15 15.71 npm WARN deprecated uuid@3.4.0: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.
#15 15.78 npm WARN deprecated uuid@3.4.0: Please upgrade  to version 7 or higher.  Older versions may use Math.random() in certain circumstances, which is known to be problematic.  See https://v8.dev/blog/math-random for details.

Not a big issue but some of them could be easily fixed.

Change 966029 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Upgrade to nodejs 18 on Debian Bookworm

https://gerrit.wikimedia.org/r/966029

Change 967957 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: Update Docker images of change-prop services

https://gerrit.wikimedia.org/r/967957

Change 967957 merged by Elukey:

[operations/deployment-charts@master] services: Update Docker images of change-prop services

https://gerrit.wikimedia.org/r/967957

Deployment to staging was fine, no errors in the logs etc.. The only thing that I noticed is:

https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&from=1698074820470&to=1698077844470&viewPanel=54

There is an increase in cpu usage, same issue registered in T341140#9048907.

I tried to profile nodejs' code via -prof and --prof-process to have a better view of the CPU usage. I tried first with perf but I didn't obtain useful info for the JS code, so I opted for the default profiler.

What I did was:

  • Edit the changeprop's Deployment resource in staging, adding to the node's args --prof, and changing the workdir to /tmp.
  • After a while, run /bin/node --prof-process /tmp/isolate-0x55d7f3e42810-143-v8-143.log. In this case, "143" is the pid of service-runner.js, called by /srv/service/server.js (pid 1 in this case).
  • The full output is https://phabricator.wikimedia.org/P53042

Some observations:

  • The majority of the "ticks" are related to libc, and after some reading it is most of the times due to calls to epoll_wait. I verified this via strace and indeed it seems the case. Moreover the increase in CPU was user not system, so I discarded the "shared libraries" values.
  • Among the Javascript ticks, only one caught my eye:
[JavaScript]:
  ticks  total  nonlib   name
  1453    0.2%   18.6%  LazyCompile: *listOnTimeout node:internal/timers:512:25
   823    0.1%   10.6%  LazyCompile: *processTimers node:internal/timers:492:25
   475    0.0%    6.1%  LazyCompile: *<anonymous> /srv/service/lib/base_executor.js:237:18
   338    0.0%    4.3%  LazyCompile: *insert node:internal/timers:355:16
   332    0.0%    4.3%  LazyCompile: *Promise._settlePromise0 /srv/service/node_modules/bluebird/js/release/promise.js:644:45
   316    0.0%    4.1%  LazyCompile: *Promise._then /srv/service/node_modules/bluebird/js/release/promise.js:248:36
   312    0.0%    4.0%  LazyCompile: *ret :4:28
   268    0.0%    3.4%  LazyCompile: *<anonymous> /srv/service/lib/kafka_factory.js:43:46
   221    0.0%    2.8%  LazyCompile: *<anonymous> /srv/service/node_modules/node-rdkafka/lib/kafka-consumer.js:466:56

In /srv/service/lib/kafka_factory.js:43:46 We have the code that we call periodically to check the librdkafka status (like delivery notification etc..):

this.on('ready', () => {
    this._pollInterval = setInterval(() => this.poll(), 10);
});

We use pollInterval in various places, and IIUC it forces to check delivery status etc.. every 10ms. So multiple functions are executed 10ms to check various statuses of noderdkafka/librdkafka, and we have upgraded the library as part of the move to Node 18. I am pretty sure that experimenting with other values may lead to better CPU usage, but I am not 100% sure what the repercussions would be.

From the CPU graphs, a single changeprop nodejs container jumped from ~70/80ms of CPU usage to ~150ms, but way below its limit. I am wondering if spending a ton of time debugging this issue is worth it, or if this performance "degradation" is mild and can be tolerated.

Profiled changeprop on nodejs 10 in staging as well: https://phabricator.wikimedia.org/P53054

One thing that I noticed is this:

node 18:

ticks  total  nonlib   name
1453    0.2%   18.6%  LazyCompile: *listOnTimeout node:internal/timers:512:25
 823    0.1%   10.6%  LazyCompile: *processTimers node:internal/timers:492:25
 475    0.0%    6.1%  LazyCompile: *<anonymous> /srv/service/lib/base_executor.js:237:18
 338    0.0%    4.3%  LazyCompile: *insert node:internal/timers:355:16
 332    0.0%    4.3%  LazyCompile: *Promise._settlePromise0 /srv/service/node_modules/bluebird/js/release/promise.js:644:45
 316    0.0%    4.1%  LazyCompile: *Promise._then /srv/service/node_modules/bluebird/js/release/promise.js:248:36
 312    0.0%    4.0%  LazyCompile: *ret :4:28
 268    0.0%    3.4%  LazyCompile: *<anonymous> /srv/service/lib/kafka_factory.js:43:46

node 10

ticks  total  nonlib   name
 100    0.1%    4.8%  Builtin: CallFunction_ReceiverIsAny
  95    0.1%    4.5%  LazyCompile: *peg$fail /srv/service/node_modules/template-expression-compiler/ExpressionParser.js:258:22
  85    0.1%    4.1%  Builtin: InterpreterEntryTrampoline
  74    0.1%    3.5%  LazyCompile: *ontimeout timers.js:429:19
  69    0.0%    3.3%  Builtin: LoadIC
  56    0.0%    2.7%  LazyCompile: *emitBeforeScript internal/async_hooks.js:340:26
  46    0.0%    2.2%  LazyCompile: *consumer.consumeAsync.then.catch.finally /srv/service/lib/base_executor.js:237:18
  44    0.0%    2.1%  Builtin: KeyedLoadIC_Megamorphic
  43    0.0%    2.1%  Builtin: KeyedLoadIC
  39    0.0%    1.9%  Builtin: CompileLazy
  36    0.0%    1.7%  LazyCompile: *remove internal/linkedlist.js:15:16
  29    0.0%    1.4%  Builtin: Call_ReceiverIsAny
  28    0.0%    1.3%  LazyCompile: *Promise._settlePromise /srv/service/node_modules/bluebird/js/release/promise.js:577:44
  27    0.0%    1.3%  LazyCompile: *processTimers timers.js:220:23
  27    0.0%    1.3%  Builtin: StoreIC
  26    0.0%    1.2%  LazyCompile: *ret :4:28
  24    0.0%    1.1%  Stub: JSEntryStub
  24    0.0%    1.1%  LazyCompile: *_pollInterval.setInterval /srv/service/lib/kafka_factory.js:43:46
  23    0.0%    1.1%  Builtin: JSEntryTrampoline

The *listOnTimeout node:internal/timers:512:25 top value on node 18 is very suspicious, I am wondering if the 10ms poll time that I mentioned above is now heavier on node 18 due to how timers have changed from 10 to 18.

Change 968986 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/change-propagation@master] kafka_factory: increase poll interval

https://gerrit.wikimedia.org/r/968986

Change 968986 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] kafka_factory: increase poll interval

https://gerrit.wikimedia.org/r/968986

Found a way to generate Flame Graphs:

  • nodejs 10 version:
  • nodejs 18 version:

Procedure:

  • Added the following to nodejs --perf-basic-prof --no-turbo-inlining --interpreted-frames-native-stack
  • Ran sudo perf record -F 99 -p $pid -g -- sleep 30s ($pid is the pid of the nodejs process in the root pid namespace)
  • Copied (via docker cp) the /tmp/perf-$pid.map files from the container to the /tmp dir of kubestage1004 (the $pid inside the container is of course different from the one assigned in the root pid namespace, so you need to adapt it).
  • Ran sudo perf script --header> stacks.$pid.out
  • Cloned the FlameGraph github repo and ran ./stackcollapse-perf.pl < /tmp/stacks.2580321.out | ./flamegraph.pl > ~/out.stacks.$pid.svg

Tested event handling of the Lift Wing rules in staging, everything looks good afaics.

Change 969339 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/change-propagation@master] kafka_factory: add handler for debug events

https://gerrit.wikimedia.org/r/969339

After reading the always enlightening page https://www.brendangregg.com/perf.html I tried to run perf with the LBR (--call-graph lbr), as attempt to figure out what those unknowns are in the flame graph, and I got this:

The above is for nodejs-18 + node-rdkafka 2.10. The librdkafka build uses optimizations (afaics it is not configurable) and so the frame pointer may be omitted, ending up in the unknowns. Anyway, it is clear that the majority of the cpu time is spent in librdkafka.

Some general notes:

  • We jumped from librdkafka 1.3 to 2.3 with the node-rdkafka bump. It is a big jump, but I checked changelogs and I didn't see anything useful to test for our use case.
  • Downgrading node-rdkafka to 2.8.1 (the original version before the node 18 upgrade) ends up in compilation failures, afaics due to the incompatibility of the module with nodejs-18.

Change 969339 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] kafka_factory: add handler for producer/consumer debug events

https://gerrit.wikimedia.org/r/969339

Deployed the new image to staging, and added the debug:all settings for consumer and producer. What I see is that librdkafka keeps trying to fetch data from various topics, and I think it is mostly due to:

_consume() {
    if (!this._connected) {
        return;
    }

    this._consuming = true;
    this.consumer.consumeAsync(this.consumerBatchSize)
    .then((messages) => {
        if (!messages.length) {
            // No new messages, delay a bit and try again.
            return P.delay(100);
        }

We use http://bluebirdjs.com/docs/api/promise.promisifyall.html to make the node-rdkafka's consumer async, and we keep try to consume messages every 100ms if nothing is provided by the topic. In staging this means a lot of back and forth, that could explain the increase in CPU time that we are seeing. The last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18, so the current way of implementing async in changeprop could very well be not performant / outdated.

Change 969757 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update the ChangeProp staging's docker image

https://gerrit.wikimedia.org/r/969757

Change 969758 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: update ChangeProp's eqiad Docker image

https://gerrit.wikimedia.org/r/969758

last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18

FWIW, swapping out any direct usages of Bluebird in change-prop code should be relatively easy: just use Promise instead. IIUC, the API should be the same.

Swapping it out in other libs (e.g. preq, service-runner) will take more work.

Change 969757 merged by Elukey:

[operations/deployment-charts@master] services: update the ChangeProp staging's docker image

https://gerrit.wikimedia.org/r/969757

last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18

FWIW, swapping out any direct usages of Bluebird in change-prop code should be relatively easy: just use Promise instead. IIUC, the API should be the same.

Swapping it out in other libs (e.g. preq, service-runner) will take more work.

The current cp consumer is like this:

createConsumer(groupId, topics, metrics, logger) {
    const conf = Object.assign({}, this._consumerConf);
    conf['group.id'] = groupId;
    conf['client.id'] = `${Math.floor(Math.random() * 1000000)}`;

    return new P((resolve, reject) => {
        const consumer = new kafka.KafkaConsumer(conf, this._consumerTopicConf);
        consumer.on('event.log', e => logger.log('debug/consumer', {
            message: e
        }));
        consumer.connect(undefined, (err) => {
            if (err) {
                return reject(err);
            }
            consumer.subscribe(topics);
            resolve(P.promisifyAll(consumer));
        });
    });
}

The promisifyAll(consumer) call IIUC creates some Async functions (like AsyncConsume etc..) that are used in the codebase instead of consume directly :(

Change 969765 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/change-propagation@master] kafka_factory: simplify librdkafka logging

https://gerrit.wikimedia.org/r/969765

The promisifyAll(consumer) call IIUC creates some Async functions

Looks like only:

  • consumer.consumeAsync
  • consumer.commitMessageAsync
  • consumer.disconnectAsync

are used, so you could just call native promisify on these manually and assign them to consumer before returning it. E.g.

consumer.consumeAsync = promisify(consumer.consume);
consumer.commitMessageAsync = promisify(consumer.commitMessage);
consumer.disconnectAsync = promisify(consumer.disconnect);
resolve(consumer)

?

But also yea, this comsumer stuff could be moved into node-rdkafka-factory and shared.

Change 969765 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] kafka_factory: simplify librdkafka logging

https://gerrit.wikimedia.org/r/969765

Change 970276 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: allow to specify consumer/producer kafka settings

https://gerrit.wikimedia.org/r/970276

Change 970276 merged by Elukey:

[operations/deployment-charts@master] changeprop: allow to specify consumer/producer kafka settings

https://gerrit.wikimedia.org/r/970276

Change 969758 merged by Elukey:

[operations/deployment-charts@master] services: update ChangeProp's eqiad Docker image

https://gerrit.wikimedia.org/r/969758

Change 971113 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: allow to define Kafka settings for Job Queues

https://gerrit.wikimedia.org/r/971113

Change 971114 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: upgrade changeprop jobqueue eqiad's docker image

https://gerrit.wikimedia.org/r/971114

Change 971225 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] changeprop: set num_workers to zero

https://gerrit.wikimedia.org/r/971225

Change 971114 merged by Elukey:

[operations/deployment-charts@master] services: upgrade changeprop jobqueue eqiad's docker image

https://gerrit.wikimedia.org/r/971114

Change 971225 merged by Elukey:

[operations/deployment-charts@master] changeprop: set num_workers to zero

https://gerrit.wikimedia.org/r/971225

Change 974476 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: bump cpu limits and Docker images for cp instances

https://gerrit.wikimedia.org/r/974476

Change 974476 merged by Elukey:

[operations/deployment-charts@master] services: bump cpu limits and Docker images for cp instances

https://gerrit.wikimedia.org/r/974476

All changeprop instances (except beta) run nodejs 18 and the last version of node-rdkafka. As explained above, there was an increase in cpu usage that seems to be related to how librdkafka works with node 18 (see above flame graphs for more info). No regression in change propagation's efficiency was registered (high backlogs, etc..) so I think the upgrade was a success (overall).

Last steps:

  • Deploy the cpu limit increases in eqiad instances (Will do it next week)
  • Deploy the new change-prop version in beta
elukey claimed this task.
elukey updated the task description. (Show Details)

Beta deployed by @Ottomata as part of T351247 (thanks!)

I only deployed changeprop, not cpjobqueue.

Change 971113 merged by Elukey:

[operations/deployment-charts@master] changeprop: refactor templating for Kafka producer/consumer settings

https://gerrit.wikimedia.org/r/971113