Page MenuHomePhabricator

memory leak on keyholder-proxy on buster/python 3.7
Open, MediumPublic

Description

It looks like keyholder suffers a memory leak that's only triggered when it's running on buster/python 3.7. Currently the only affected instances are the acme-chief servers.

acmechief-test1001
keyhold+   429  0.0 13.7 449792 280176 ?       Ss   Sep13  17:52 python3 /usr/local/bin/ssh-agent-proxy --bind /run/keyholder/proxy.sock --connect /run/keyholder/agent.sock
acmechief-test2001
keyhold+   396  0.0 16.1 405928 329472 ?       Ss   Sep13  13:31 python3 /usr/local/bin/ssh-agent-proxy --bind /run/keyholder/proxy.sock --connect /run/keyholder/agent.sock

Details

Related Gerrit Patches:
operations/software/keyholder : masterdaemon: fix memory leak in Python 3.7+
operations/puppet : productionkeyholder: notify only the proxy on proxy changes
operations/puppet : productionkeyholder: fix memory leak in Python 3.7+

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Nov 28, 8:06 AM
Volans added a subscriber: Volans.Thu, Nov 28, 9:37 AM

I'm doing a quick debug attempt on acmechief-test2001

Change 553460 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] keyholder: fix memory leak in Python 3.7+

https://gerrit.wikimedia.org/r/553460

I was able to debug the issue using tracemalloc:

# at the top of the file
import tracemalloc
tracemalloc.start(5)

# in the SshAgentProxyHandler class
# in setup()
self.prev_trace = tracemalloc.take_snapshot()

# in the `while 1` in `handle()`
cur_trace = tracemalloc.take_snapshot()
trace_stats = cur_trace.compare_to(self.prev_trace, 'lineno')
for trace_stat in trace_stats[:3]:
    logger.info(trace_stat)
self.prev_trace = cur_trace

That pointed to the threading module:

/usr/lib/python3.7/threading.py:553: size=144 B (+144 B), count=2 (+2), average=72 B

And from there googling around pointed me to this change in default that is described in the commit message of the above CR that should have the fix, at least from the tests I was able to do.
That property exists back to Python 3.4, so the change should be fully backward compatible up to Jessie at least.
To be fair the tracemalloc logging is present also with the fix, but the RES memory used is not going steadily up (although slowly) but it's fluctuating. So I'm fairly confident the fix should work.

I've restored acmechief-test2001 back to its normal status.

Change 553481 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/keyholder@master] daemon: fix memory leak in Python 3.7+

https://gerrit.wikimedia.org/r/553481

jbond triaged this task as Medium priority.Thu, Nov 28, 4:30 PM

Mentioned in SAL (#wikimedia-operations) [2019-11-29T09:01:13Z] <volans> temporary disabling puppet on 'R:keyholder::agent' to merge gerrit:operations/puppet/+/553460 - T239386

Change 553460 merged by Volans:
[operations/puppet@production] keyholder: fix memory leak in Python 3.7+

https://gerrit.wikimedia.org/r/553460

Change 553700 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] keyholder proxy: do not notify the agent

https://gerrit.wikimedia.org/r/553700

Change 553700 merged by Volans:
[operations/puppet@production] keyholder: notify only the proxy on proxy changes

https://gerrit.wikimedia.org/r/553700

ema moved this task from Triage to TLS on the Traffic board.Fri, Nov 29, 2:18 PM

Change 553481 merged by jenkins-bot:
[operations/software/keyholder@master] daemon: fix memory leak in Python 3.7+

https://gerrit.wikimedia.org/r/553481

Volans added a comment.Mon, Dec 9, 2:34 PM

So far so good, leaving it open for another week or two to ensure the issue is totally fixed.