Page MenuHomePhabricator

ChangeProp failing on Node v4.6.0
Closed, ResolvedPublic

Description

After upgrading Node to v4.6.0, Change-Prop's master process starts failing after initialising its workers:

[2016-10-11T12:20:30.739Z]  INFO: changeprop/2 on scb1001: master(2) initializing 24 workers (levelPath=info/service-runner)
[2016-10-11T12:20:35.148Z]  INFO: changeprop/136 on scb1001: Kafka Queue module initialised (levelPath=info/change-prop/init)
[2016-10-11T12:20:35.150Z]  WARN: changeprop/136 on scb1001: listening on *:7272 (levelPath=warn/startup)
...
[2016-10-11T12:21:35.927Z]  INFO: changeprop/4532 on scb1001: Kafka Queue module initialised (levelPath=info/change-prop/init)
[2016-10-11T12:21:35.930Z]  WARN: changeprop/4532 on scb1001: listening on *:7272 (levelPath=warn/startup)
[2016-10-11T12:21:37.521Z] FATAL: changeprop/2 on scb1001: false == true (err.actual=false, err.expected=true, err.operator===, err.generatedMessage=true, err.levelPath=fatal/service-runner/unhandled)
    AssertionError: false == true
        at SharedHandle.options.process.on.process.on.SharedHandle.add (cluster.js:84:3)
        at queryServer (cluster.js:488:12)
        at Worker.onmessage (cluster.js:436:7)
        at ChildProcess.<anonymous> (cluster.js:749:8)
        at emitTwo (events.js:92:20)
        at ChildProcess.emit (events.js:172:7)
        at internal/child_process.js:696:12
        at nextTickCallbackWith0Args (node.js:420:9)
        at process._tickCallback (node.js:349:13)

The release notes reveal that there were some changes introduced in v4.5.0 concerning the clustermodule, which might have caused a regression:

An initial investigation proved it impossible to reproduce the error: it would appear only on scb1001 when the service is started via SystemD; starting it manually does not cause the error, nor does starting it on scb2001. The previous iteration of this error was connected to UDP sockets and this seems to be the case again. Yet, it is unclear at this point why the error manifests itself only when the service is run via SystemD.

Event Timeline

nor does starting it on scb2001

That's interesting. Any idea why ?

nor does starting it on scb2001

That's interesting. Any idea why ?

One possible difference between eqiad and codfw is that we don't actually consume events in codfw, so no real processing is happening there. We wait for 1 minute before starting consumption and it seems to be close to the timings at the logs - first worker started at 12:20:35 and crash happened 1 minute 2 seconds later at 12:21:37.

Yes, indeed. I ruled out metrics sending. It seems the node-rdkafka and / or librdkafka have something to do with it, since the service doesn't crash when there is no message processing.

Pchelolo edited projects, added Services (next); removed Services.

This is looking more and more related to the driver's consumption process. I confirmed that purely subscribing to topics does not crash the service. Also, for obvious reasons, the bug does not manifest itself when no worker is forked, i.e. for ncpu = 0. Eliminating message production back to Kafka and committing offsets still makes the bug manifest itself. So, we are left only with message consumption.

Pchelolo edited projects, added Services (doing); removed Services (next).

Looking at the related code in node the consumption from Kafka and the crash are correlated though UDP. The more kafka messages are consumed the more UDP traffic is created via metrics/logging/varnish purging.

To support this theory I've commented out all stuff that uses UDP and tried to start the service - it works like a charm.

The immediate candidate for being blamed here is the varnish purging code - if I uncomment it the bug comes back.

Looking at the related code in node the consumption from Kafka and the crash are correlated though UDP. The more kafka messages are consumed the more UDP traffic is created via metrics/logging/varnish purging.

To support this theory I've commented out all stuff that uses UDP and tried to start the service - it works like a charm.

The immediate candidate for being blamed here is the varnish purging code - if I uncomment it the bug comes back.

When I was running CP under my user from SCB I had logging disabled and the bug did not manifest. That would indicate that logging should be to blame too, but that's not the case because RESTBase is already actively using the same logging facilities in production without issues.

Here's a PR with a workaround that seems to fix the problem: https://github.com/wikimedia/htcp-purge/pull/5

However, we need to report this ustream, it seems like a regression.

Since we are updating to 4.6.0 maybe it worths going for 4.6.1 right away? It's a new security release that came out today: https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V4.md#4.6.1

@Pchelolo : This doesn't affect us, we're linking dynamically against Debian's copy of c-ares and the fix for that is already deployed cluster-wide.

The htcp-purge did the trick! Thank you @Pchelolo !