Page MenuHomePhabricator

investigate slapd memory leak
Closed, ResolvedPublic

Description

it looks like slapd / openldap regularly leaks memory on seaborgium / serpens, and one of the causes for T130446: Unable to SSH onto tools-login.wmflabs.org

2016-03-22-111610_430x290_scrot.png (290×430 px, 22 KB)
2016-03-22-111559_430x292_scrot.png (292×430 px, 23 KB)

Related Objects

Event Timeline

Moving to a backport of 2.4.41 is probably the better solution, though.

Seaborgium went OOM briefly just now and threw some errors.

seems like we had one again today. serpens and seaborgium were reported by Icinga as having various issues. then serpens recoverd itself. seaborgium showed puppet errors. so i ran puppet to check and:

seaborgium puppet-agent[6631]: Could not run command from postrun_command: Cannot allocate memory - fork(2)

i restarted slapd and that apparently fixed it

Mentioned in SAL [2016-05-02T09:58:44Z] <moritzm> uploaded openldap 2.4.41+wmf1 for jessie-wikimedia to carbon (T130593)

I have tested the 2.4.41 packages in vagrant with a syncrepl setup and seems fine. Update will happen next week, not really something for a Friday...

Mentioned in SAL [2016-07-17T10:31:13Z] <godog> restart slapd on serpens - T130593

Mentioned in SAL [2016-07-19T09:10:29Z] <godog> upgrade slapd to 2.4.41+dfsg-1+wmf1 on serpens - T130593

freshly restarted slapd

openldap 16378  1.8  1.2 466556 52048 ?        Ssl  09:10   0:03 /usr/sbin/slapd -h ldap:/// ldaps:/// ldapi:/// -g openldap -u openldap -f /etc/ldap/slapd.conf

serpens still shows some memory growth, possibly not fixed yet

screenshot_kTxpIK.png (292×422 px, 22 KB)

I had to reboot seaborgium today as it froze up and took out ldap with it.

!log gnt-instance reboot seaborgium.wikimedia.org

I would say...definitely something is still up, and the really confusing thing is that the presence of serpens does not stop a sea of cascading failure.

indeed, looks like serpens started leaking memory even with an updated slapd, and as soon as it took over from seaborgium.

yuvipanda raised the priority of this task from Medium to High.

(moving to high since this caused a couple more outages)

Have we considered giving it more RAM?

Have we considered giving it more RAM?

Won't help much, only stretching the interval until it OOMs at some point again.

Change 300902 had a related patch set uploaded (by Dzahn):
labs: restart slapd once a week

https://gerrit.wikimedia.org/r/300902

Change 300902 merged by Andrew Bogott:
labs: restart slapd if it uses > 50% of memory

https://gerrit.wikimedia.org/r/300902

crons have been created on serpens and seaborgium. they will check once an hour (at a random minute so they are never restarted at the same time) if more than 50% of memory is used and restart the service with systemctl if that is the case.

at the time of merging seaborgium was at about 46% and serpens at about 26%. i restarted slapd on seaborgium manually once and it went down to like 1% (!sic). Serpens i left as it was, so that will reach the 50% first.

[seaborgium:~] $ sudo crontab -u root -l | grep slap
# Puppet Name: restart_slapd
30 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1  }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null

[serpens:~] $ sudo crontab -u root -l | grep slap
# Puppet Name: restart_slapd
49 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1  }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null
[seaborgium:~] $ /bin/ps -C slapd -o pmem=
 3.5

[serpens:~] $ /bin/ps -C slapd -o pmem=
27.1
akosiaris claimed this task.
akosiaris added a subscriber: akosiaris.

I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen

Yep, no problems in ages. Thanks for the bug cleanup.

I don't think this should be closed as long as this stuff exists:

modules/role/manifests/openldap/labs.pp

# restart slapd if it uses more than 50% of memory (T130593)
cron { 'restart_slapd':
    ensure  => present,
    minute  => fqdn_rand(60, $title),
    command => "/bin/ps -C slapd -o pmem= | awk '{sum+=\$1} END { if (sum <= 50.0) exit 1 }' \
    && /bin/systemctl restart slapd >/dev/null 2>/dev/null",
}

Yeah, that's correct, the underlying memory leak isn't fixed, only hidden by the restarts. This is likely still unfixed in stretch, there's nothing in the 2.4.41-2.4.44 changelog which points to a fix.

@GTirloni upgraded OpenLDAP on serpens to 2.4.47, but that doesn't change the memory leak.

@MoritzMuehlenhoff do you happen to know if this is still happening or we can close this?

akosiaris claimed this task.

I am just gonna re-resolve this. We haven't worked on it in 4 years, it's clearly not a priority. Should we decided to visit it again, we can always reopen