Page MenuHomePhabricator

Memory leak in Change-Prop
Closed, ResolvedPublic

Description

Since Dec 12 deploy the memory usage in ChangeProp started to grow over time: https://grafana.wikimedia.org/dashboard/db/eventbus?from=1481464969163&to=1481730678921&var-site=All&var-topic=All

and there's been a couple of worker restarts already. The deploy was big and contained the following changes:

Update change-propagation to b2bf30d

List of changes:
fdf1267 Use delivery callback to guarantee delivery
3717c06 Use 'connect' callback to figure out we're connected. Better then ready event
639eb6f Don't fail if the resolver is not found for delivery report
ad6d4b9 Actually set up the logger
9e940c4 Got rid of the function keyword
cf9a705 Use shared kafka management tools
b40f5ed Don't clean in travis
373b9e5 Run scripts after npm install
615b2aa Better sourcing the file
875151a Better sourcing the file
2cbaa03 Better check for travis environment
3f525cb Correctly kill kafka afterwards
f95bf71 Improved cleanup script soursing
35f7b9a Improved CPU usage
b2bf30d Release v0.6.3
xxxxxxx Update node module dependencies

I suspect that memory leak is related to the introduction of delivery callbacks, but this needs more investigation.

Event Timeline

Pchelolo created this task.Dec 14 2016, 5:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2016, 5:16 PM
Pchelolo added a comment.EditedDec 15 2016, 12:01 AM

I have been playing around with this half a day today and I was unable to identify the leak. I propose to try switching off the GuarateedProducer temporarily and see where would that brink us. At least we will have clearer understanding of what particular change brought a leak.

@Pchelolo, sounds sensible to me. I guess it would be good to get this out as early as possible tomorrow. Maybe @mobrovac can deploy it during his morning?

PR at https://github.com/wikimedia/change-propagation/pull/151. @mobrovac, could you look into deploying this in your morning?

mobrovac triaged this task as High priority.Dec 15 2016, 11:06 AM

Yup, on it.

Mentioned in SAL (#wikimedia-operations) [2016-12-15T11:16:32Z] <mobrovac@tin> Starting deploy [changeprop/deploy@9eab965]: Deploying fix for T153215

Mentioned in SAL (#wikimedia-operations) [2016-12-15T11:17:27Z] <mobrovac@tin> Finished deploy [changeprop/deploy@9eab965]: Deploying fix for T153215 (duration: 00m 54s)

mobrovac lowered the priority of this task from High to Medium.Dec 15 2016, 12:10 PM

The memory consumption has decreased five-fold in the last hour since the deploy, so you were right @Pchelolo - this seems to be a problem with the GuaranteedProducer.

Undeploying the GuaranteedProducer stopped the leak, so it's somehow related to the delivery report. Unfortunately, I'm still not able to reproduce this locally even with several brokers, so the only option left for the next step is to deploy the leak back to production, make 2 heap dumps and compare them. I highly doubt the leak is in JS land, most likely it's in C++ land, so we'd need to gcore and gdb it. Not sure how would ops feel about running gcore on a production machine on a running production service..

mobrovac changed the task status from Open to Stalled.Dec 16 2016, 11:52 PM

Let's first try to take some heap dumps and rule that part out. But, we will have to wait for this until after the freeze, so setting as stalled.

Pchelolo closed this task as Resolved.Mar 28 2017, 8:08 PM
Pchelolo edited projects, added Services (done); removed Services (next).

After upgrading the node-rdkafka to a newer version and using a different approach to delivery report memory doesn't leak any more, so the issue was automagically resolved.

Restricted Application added a project: Analytics. · View Herald TranscriptMar 28 2017, 8:08 PM