Page MenuHomePhabricator

investigate slapd memory leak
Open, HighPublic

Description

it looks like slapd / openldap regularly leaks memory on seaborgium / serpens, and one of the causes for T130446: Unable to SSH onto tools-login.wmflabs.org

Details

Related Gerrit Patches:

Related Objects

Event Timeline

Moving to a backport of 2.4.41 is probably the better solution, though.

Seaborgium went OOM briefly just now and threw some errors.

Dzahn added a subscriber: Dzahn.Apr 21 2016, 1:54 AM

seems like we had one again today. serpens and seaborgium were reported by Icinga as having various issues. then serpens recoverd itself. seaborgium showed puppet errors. so i ran puppet to check and:

seaborgium puppet-agent[6631]: Could not run command from postrun_command: Cannot allocate memory - fork(2)

i restarted slapd and that apparently fixed it

I'll build a backport of 2.4.41.

Mentioned in SAL [2016-05-02T09:58:44Z] <moritzm> uploaded openldap 2.4.41+wmf1 for jessie-wikimedia to carbon (T130593)

I have tested the 2.4.41 packages in vagrant with a syncrepl setup and seems fine. Update will happen next week, not really something for a Friday...

Mentioned in SAL [2016-07-17T10:31:13Z] <godog> restart slapd on serpens - T130593

Mentioned in SAL [2016-07-19T09:10:29Z] <godog> upgrade slapd to 2.4.41+dfsg-1+wmf1 on serpens - T130593

freshly restarted slapd

openldap 16378  1.8  1.2 466556 52048 ?        Ssl  09:10   0:03 /usr/sbin/slapd -h ldap:/// ldaps:/// ldapi:/// -g openldap -u openldap -f /etc/ldap/slapd.conf

serpens still shows some memory growth, possibly not fixed yet

I had to reboot seaborgium today as it froze up and took out ldap with it.

!log gnt-instance reboot seaborgium.wikimedia.org

I would say...definitely something is still up, and the really confusing thing is that the presence of serpens does not stop a sea of cascading failure.

indeed, looks like serpens started leaking memory even with an updated slapd, and as soon as it took over from seaborgium.

yuvipanda removed MoritzMuehlenhoff as the assignee of this task.Jul 25 2016, 3:42 PM
yuvipanda raised the priority of this task from Normal to High.

(moving to high since this caused a couple more outages)

Have we considered giving it more RAM?

Have we considered giving it more RAM?

Won't help much, only stretching the interval until it OOMs at some point again.

Change 300902 had a related patch set uploaded (by Dzahn):
labs: restart slapd once a week

https://gerrit.wikimedia.org/r/300902

Change 300902 merged by Andrew Bogott:
labs: restart slapd if it uses > 50% of memory

https://gerrit.wikimedia.org/r/300902

Dzahn added a comment.Aug 3 2016, 6:13 PM

crons have been created on serpens and seaborgium. they will check once an hour (at a random minute so they are never restarted at the same time) if more than 50% of memory is used and restart the service with systemctl if that is the case.

at the time of merging seaborgium was at about 46% and serpens at about 26%. i restarted slapd on seaborgium manually once and it went down to like 1% (!sic). Serpens i left as it was, so that will reach the 50% first.

[seaborgium:~] $ sudo crontab -u root -l | grep slap
# Puppet Name: restart_slapd
30 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1  }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null

[serpens:~] $ sudo crontab -u root -l | grep slap
# Puppet Name: restart_slapd
49 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1  }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null
Dzahn added a comment.Aug 3 2016, 6:14 PM
[seaborgium:~] $ /bin/ps -C slapd -o pmem=
 3.5

[serpens:~] $ /bin/ps -C slapd -o pmem=
27.1
akosiaris closed this task as Resolved.Jun 30 2017, 8:51 AM
akosiaris claimed this task.
akosiaris added a subscriber: akosiaris.

I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen

Yep, no problems in ages. Thanks for the bug cleanup.

chasemp reopened this task as Open.Jun 30 2017, 6:04 PM

I don't think this should be closed as long as this stuff exists:

modules/role/manifests/openldap/labs.pp

# restart slapd if it uses more than 50% of memory (T130593)
cron { 'restart_slapd':
    ensure  => present,
    minute  => fqdn_rand(60, $title),
    command => "/bin/ps -C slapd -o pmem= | awk '{sum+=\$1} END { if (sum <= 50.0) exit 1 }' \
    && /bin/systemctl restart slapd >/dev/null 2>/dev/null",
}

Yeah, that's correct, the underlying memory leak isn't fixed, only hidden by the restarts. This is likely still unfixed in stretch, there's nothing in the 2.4.41-2.4.44 changelog which points to a fix.

akosiaris removed akosiaris as the assignee of this task.Dec 19 2017, 10:28 AM

@GTirloni upgraded OpenLDAP on serpens to 2.4.47, but that doesn't change the memory leak.

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM