it looks like slapd / openldap regularly leaks memory on seaborgium / serpens, and one of the causes for T130446: Unable to SSH onto tools-login.wmflabs.org
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +7 -0 | labs: restart slapd if it uses > 50% of memory |
Event Timeline
There's a similar report for openldap 2.4.40 at http://www.openldap.org/lists/openldap-technical/201504/msg00005.html
There are three memory leak fixes related to syncrepl/syncprov made in openldap 2.4.41, which might cause these:
http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8035
http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8038
http://www.openldap.org/its/index.cgi/Software%20Bugs?id=8039
seems like we had one again today. serpens and seaborgium were reported by Icinga as having various issues. then serpens recoverd itself. seaborgium showed puppet errors. so i ran puppet to check and:
seaborgium puppet-agent[6631]: Could not run command from postrun_command: Cannot allocate memory - fork(2)
i restarted slapd and that apparently fixed it
Mentioned in SAL [2016-05-02T09:58:44Z] <moritzm> uploaded openldap 2.4.41+wmf1 for jessie-wikimedia to carbon (T130593)
I have tested the 2.4.41 packages in vagrant with a syncrepl setup and seems fine. Update will happen next week, not really something for a Friday...
Mentioned in SAL [2016-07-19T09:10:29Z] <godog> upgrade slapd to 2.4.41+dfsg-1+wmf1 on serpens - T130593
freshly restarted slapd
openldap 16378 1.8 1.2 466556 52048 ? Ssl 09:10 0:03 /usr/sbin/slapd -h ldap:/// ldaps:/// ldapi:/// -g openldap -u openldap -f /etc/ldap/slapd.conf
I had to reboot seaborgium today as it froze up and took out ldap with it.
!log gnt-instance reboot seaborgium.wikimedia.org
I would say...definitely something is still up, and the really confusing thing is that the presence of serpens does not stop a sea of cascading failure.
indeed, looks like serpens started leaking memory even with an updated slapd, and as soon as it took over from seaborgium.
Change 300902 had a related patch set uploaded (by Dzahn):
labs: restart slapd once a week
Change 300902 merged by Andrew Bogott:
labs: restart slapd if it uses > 50% of memory
crons have been created on serpens and seaborgium. they will check once an hour (at a random minute so they are never restarted at the same time) if more than 50% of memory is used and restart the service with systemctl if that is the case.
at the time of merging seaborgium was at about 46% and serpens at about 26%. i restarted slapd on seaborgium manually once and it went down to like 1% (!sic). Serpens i left as it was, so that will reach the 50% first.
[seaborgium:~] $ sudo crontab -u root -l | grep slap # Puppet Name: restart_slapd 30 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1 }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null [serpens:~] $ sudo crontab -u root -l | grep slap # Puppet Name: restart_slapd 49 * * * * /bin/ps -C slapd -o pmem= | awk '{sum+=$1} END { if (sum <= 50.0) exit 1 }' && /bin/systemctl restart slapd >/dev/null 2>/dev/null
[seaborgium:~] $ /bin/ps -C slapd -o pmem= 3.5 [serpens:~] $ /bin/ps -C slapd -o pmem= 27.1
I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen
I don't think this should be closed as long as this stuff exists:
modules/role/manifests/openldap/labs.pp
# restart slapd if it uses more than 50% of memory (T130593) cron { 'restart_slapd': ensure => present, minute => fqdn_rand(60, $title), command => "/bin/ps -C slapd -o pmem= | awk '{sum+=\$1} END { if (sum <= 50.0) exit 1 }' \ && /bin/systemctl restart slapd >/dev/null 2>/dev/null", }
Yeah, that's correct, the underlying memory leak isn't fixed, only hidden by the restarts. This is likely still unfixed in stretch, there's nothing in the 2.4.41-2.4.44 changelog which points to a fix.
@GTirloni upgraded OpenLDAP on serpens to 2.4.47, but that doesn't change the memory leak.
@MoritzMuehlenhoff do you happen to know if this is still happening or we can close this?
I am just gonna re-resolve this. We haven't worked on it in 4 years, it's clearly not a priority. Should we decided to visit it again, we can always reopen