Page MenuHomePhabricator

LDAP server running out of memory frequently and disrupting Cloud VPS clients
Closed, ResolvedPublic

Description

The rate of memory leaking from T130593: investigate slapd memory leak seems to have increased to a point where the band-aid solution of cron'd restarts that was implemented 2 years ago is now causing noticeable client disruption.


original bug report

I maintain dexbot tool in toolforge and it contains lots of tools the community depends on, I received reports that they stopped working and all of my services give 503. (like: https://tools.wmflabs.org/dexbot/d.php). when I try to become dexbot to do the classic "turning webservice off and on again", I get this:

ladsgroup@tools-sgebastion-07:~$ become dexbot
groups: cannot find name for group ID 50062

I became the tool though.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Two new root emails for jobs that did not launch because of LDAP hiccups and two more queues in disabled state as a result:

scheduling info:            queue instance "task@tools-sgeexec-0925.tools.eqiad.wmflabs" dropped because it is temporarily not available
                            queue instance "task@tools-sgeexec-0909.tools.eqiad.wmflabs" dropped because it is temporarily not available
bd808 renamed this task from groups: cannot find name for group ID to LDAP server running out of memory frequently and disrupting Cloud VPS clients.Mar 7 2019, 5:56 AM
bd808 updated the task description. (Show Details)

The issue is noticeable on the Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ since it runs every 10 minutes or so. They all failed looking up 50062, which is the project-bastion group and is fairly large. I would assume the ldap query times out somehow.

For the failures, on contint1001.wikimedia.org:

grep -l 'cannot find name for group' /srv/jenkins/builds/beta-scap-eqiad/*/log|xargs -n1 ls -la|awk '{ print $6 " " $7 " " $8 }'
Feb 28 11:01
Feb 28 11:50
Feb 28 21:30
Mar 1 00:02
Mar 1 11:05
Mar 1 23:25
Mar 2 01:17
Mar 2 07:30
Mar 2 09:15
Mar 2 15:42
Mar 2 23:07
Mar 3 06:55
Mar 3 07:15
Mar 3 07:35
Mar 3 16:20
Mar 3 17:20
Mar 4 06:05
Mar 4 11:45
Mar 4 12:05
Mar 4 17:30
Mar 4 17:45
Mar 5 04:45
Mar 5 21:31
Mar 6 11:18
Mar 6 11:58
Mar 6 15:36
Mar 6 16:48
Mar 6 17:31
Mar 6 18:50
Mar 7 06:00
Mar 7 06:25
Mar 7 07:47

So we still had group/ldap lookup failure after the slapd upgrade and vCPU raise.

The job runs on the WMCS instance deployment-deploy01.deployment-prep.eqiad.wmflabs. nslcd has connections established to both LDAP servers:

deployment-deploy01$ sudo netstat --wide -tp|grep :ldap
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:57762 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:50030 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:39256 seaborgium.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:55294 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:34932 seaborgium.wikimedia.org:ldap ESTABLISHED 565/nslcd

I see ldap:// connections timing out for both servers in /var/log/daemon.log

I had a closer look at Grafana eqiad OpenLDAP threads. The metrics come from a Prometheus exporter which exposes metrics listed at https://github.com/jcollie/openldap_exporter#example-output

I could not find an upstream doc about those metrics and ended up looking at the source code from servers/slapd/back-monitor/thread.c. They are not that helpful though but still:

Threads metricDescription from sourceValueNote
ActiveNumber of active threads16Always 16!
BackloadNumber of active plus pending threads363eg: 347 pending + 16 active
MaxMaximum number of threads as configured16
Max PendingMaximum number of pending threadsnaNot captured
OpenNumber of open threads16Always 16
PendingNumber of pending threads346Fluctuates and seems way too large
StartingNumber of threads being started0
StateThread pool statenaNot captured
RunqueueQueue of running threads - besides those handling operationsnaNot captured
TasklistList of running plus standby threads besides those handling operationsnaNot captured

The prometheus exporters does not capture Max Pending, State, Runqueue`, Tasklist, might be worth filling a task to upstream and propose a patch for those.

Regardless. We have a pool of maximum 16 threads, they are always active (Active is always 16). The Pending is rather hide compared to the pool (346 / 16 = 21,625).

The monitoring metrics lacks exposure of the velocity at which the queue is being processed, maybe we can speculate that it takes long enough to process that eventually some request time out? Notably requests for project-bastion takes a while and might be blocking processing of other requests while clients are busy fetching the large response. I noticed that previously with Keyholder which retrieves all groups using a syscall which is not cached by nscd T204681#4598659 , it took a good second to retrieve.

So in the worse case scenario if each threads take one second to reply, with a ratio of 21 lot of replies would take more than 10 seconds which might be a time out somewhere.

Anyway all of that above leads me to believe that we do not process events fast enough and we should raise the pool of threads. From slapd.conf(5):

> `threads <integer>`
>
> Specify the maximum size of the primary thread pool.  The default  is  `16`;  the  minimum value is `2`.

Also from Debian package /usr/share/slapd/slapd.conf:

# The dbconfig settings are used to generate a DB_CONFIG file the first
# time slapd starts.  They do NOT override existing an existing DB_CONFIG
# file.  You should therefore change these settings in DB_CONFIG directly
# or remove DB_CONFIG and restart slapd for changes to take effect.

# For the Debian package we use 2MB as default but be sure to update this
# value if you have plenty of RAM
dbconfig set_cachesize 0 2097152 0

Might be worth raising that database cache to more than 2MBytes since we do have plenty of RAM. Then I don't have cache hits/misses metrics for that.

Mentioned in SAL (#wikimedia-operations) [2019-03-07T11:28:30Z] <gtirloni> updated seaborgium to stretch (T217280)

Change 494911 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Set thread pool based on processor count

https://gerrit.wikimedia.org/r/494911

@hashar nice research, thank you! I've submitted a change to increase the number of threads. It does seem we have a backlog of threads periodically and this could help with that.

I haven't submitted a change to increase the cache size yet because that should help with disk reads and I barely see any disk I/O on these servers (which I take to mean the cache is working okay and slapd doesn't have to go to disk for something). My understand was also that this cachesize is the BerkleyDB in-memory cache size. If that's for caching to avoid doing a network requests instead, then my reasoning doesn't apply.

I'm really missing a LDAP latency graph.

Next step is to try Moritz' suggestion and point Toolforge Stretch to a single LDAP server to see how load behaves.

Screenshot from 2019-03-07 09-57-20.png (269×1 px, 73 KB)

This is impact of the proposed change to thread count, manually applied to both LDAP servers 11:45-12:50 UTC (and reverted). It doesn't help with all the peaks, which still need to be investigated.

Change 494922 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: increase group TTL from 60 to 3600 seconds in labs

https://gerrit.wikimedia.org/r/494922

Most active talkers to LDAP (in bytes, ~10min packet capture):

  1. 68M deployment-deploy01.deployment-prep.eqiad.wmflabs
  2. 61M tools-sgeexec-0914.tools.eqiad.wmflabs
  3. 56M tools-sgeexec-0935.tools.eqiad.wmflabs
  4. 43M soweego-2.soweego.eqiad.wmflabs
  5. 39M tools-sgeexec-0932.tools.eqiad.wmflabs
  6. 30M tools-sgeexec-0905.tools.eqiad.wmflabs
  7. 28M tools-sgeexec-0922.tools.eqiad.wmflabs
  8. 26M tools-sgeexec-0921.tools.eqiad.wmflabs
  9. 26M integration-slave-docker-1034.integration.eqiad.wmflabs
  10. 24M tools-sgeexec-0912.tools.eqiad.wmflabs
  11. 23M tools-sgeexec-0931.tools.eqiad.wmflabs
  12. 22M tools-sgeexec-0927.tools.eqiad.wmflabs
  13. 21M tools-mail-02.tools.eqiad.wmflabs

For perspective, the first Toolforge Trusty server appears at #56 (tools-bastion-02 with 1.2M)

I added nslcd debug logging on deployment-deploy01:

/etc/nslcd.conf
log /var/log/nslcd.log debug

And disabled puppet. Will let it run for a while then process the results.

nslcd debug log dumps each of the fields returned for a given query. That can be found by grepping for deref and grouping by the summary of the query (eg: group mbmers lookup group=50062):

$ grep deref nslcd.log|cut -d\  -f3|sort|uniq -c|sort -rn
1478050 <group=50062>
 470060 <group(all)>
  73374 <group=50380>
  13340 <group=50120>
   4347 <group=1003>
   1836 <group=50174>
    714 <group=53809>
    435 <group="project-deployment-prep">
    160 <group=52308>
    140 <group=53013>
    126 <group=50156>
    119 <group=52033>
     98 <group=50074>
     56 <group=52388>
     44 <group=500>
     20 <group=50258>
     14 <group=53839>
      8 <group=53985>
      7 <group=52803>
      6 <group=603>
      6 <group=51691>
      5 <group=53861>
      4 <group=52633>
      4 <group=52325>
      2 <group="mwdeploy">

I extracted group(all) for a single run:

$ grep 6d8d3c nslcd.log |grep -v deref
nslcd: [6d8d3c] DEBUG: connection from pid=382 uid=497 gid=53807
nslcd: [6d8d3c] <group(all)> DEBUG: myldap_search(base="ou=groups,dc=wikimedia,dc=org", filter="(objectClass=posixGroup)")
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=wikidev,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=wmf,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=ops,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=labsadminbots,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-bastion,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-gerrit,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-openstack,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-wikidata-dev,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-account-creation-assistance,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-analytics,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): ... 196 more results
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): end of results (206 total)
nslcd: [6d8d3c] <group(all)> DEBUG: myldap_search(base="ou=servicegroups,dc=wikimedia,dc=org", filter="(objectClass=posixGroup)")
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.wmib,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.cluebot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.czarchivebot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.rezabot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.asurabot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.stemmeberettigelse,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.logs,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.admin,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.coibot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.linkwatcher,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): ... 2250 more results
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): end of results (2260 total)

And eventually there are 12370 matches for deref for that single query which are hmm, the groups members. If we are just looking for a list of available groups, probably we do not need all their members?

I mentioned it previously sorry. LDAP spam from deployment-deploy01 is due to Keyholder / T204681. It is pending releasing a Debian package and switching from puppet based deployment toward the package.

Change 494911 abandoned by GTirloni:
openldap: Set thread pool based on processor count

Reason:
Since we've identified that Toolforge Stretch is doing way more LDAP requests then Toolforge Trusty, it seems increasing slapd threads is not worth the risk of introducing a new variable in this puzzle. Thanks for all the feedback.

https://gerrit.wikimedia.org/r/494911

Mentioned in SAL (#wikimedia-cloud) [2019-03-08T17:49:27Z] <bd808> Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)

Change 495490 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495490

Change 494922 merged by GTirloni:
[operations/puppet@production] ldap: increase group TTL from 60 to 300 seconds in labs

https://gerrit.wikimedia.org/r/494922

Mentioned in SAL (#wikimedia-operations) [2019-03-10T22:35:47Z] <gtirloni> toolforge stretch: increased nscd group TTL from 60 to 300sec (T217280)

Mentioned in SAL (#wikimedia-cloud) [2019-03-11T00:53:53Z] <bd808> Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

Mentioned in SAL (#wikimedia-operations) [2019-03-10T22:35:47Z] <gtirloni> toolforge stretch: increased nscd group TTL from 60 to 300sec (T217280)

Impact of this has been barely noticeable in the LDAP operations graph, which suggests we may indeed have a situation where nscd/caching is bypassed in SGE.

Mentioned in SAL (#wikimedia-operations) [2019-03-11T12:23:27Z] <gtirloni> updated slapd to version 2.4.47 on serpens (T217280)

Change 495490 abandoned by GTirloni:
openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495490

Change 495861 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495861

Mentioned in SAL (#wikimedia-operations) [2019-03-12T10:18:58Z] <gtirloni> updated slapd to version 2.4.47 on seaborgium (T217280)

Change 496800 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: Increase nscd cache size

https://gerrit.wikimedia.org/r/496800

Change 496800 merged by GTirloni:
[operations/puppet@production] ldap: Increase nscd cache size

https://gerrit.wikimedia.org/r/496800

Mentioned in SAL (#wikimedia-cloud) [2019-03-15T16:00:38Z] <gtirloni> increased nscd cache size (T217280)

Change 496812 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: Increase nscd max file size

https://gerrit.wikimedia.org/r/496812

Change 496812 merged by GTirloni:
[operations/puppet@production] ldap: Increase nscd max file size

https://gerrit.wikimedia.org/r/496812

Mentioned in SAL (#wikimedia-cloud) [2019-03-15T21:08:39Z] <bstorm_> cleared error state on several queues T217280

Change 496991 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] ldap: disable group member list expansion on Stretch clients

https://gerrit.wikimedia.org/r/496991

I'm not seeing any meaningful impact on LDAP operations from our latest changes.

Screenshot from 2019-03-18 11-20-00.png (262×848 px, 61 KB)

Sorry I have lost track of this ticket since March 7th but seems it had ample activity.

It is less of an issue today. Notably the jenkins job that updates beta cluster is no more failing, or at least not as often as it used to (that was dupe task T217539 ).

I still have some group failures from time to time. Eg on ssh login just now:

groups: cannot find name for group ID 50062
groups: cannot find name for group ID 50380

But overall that has largely improved!

Change 495861 abandoned by GTirloni:
openldap: Use newer slapd from stretch-backports

Reason:
This will need to be better coordinated among all LDAP deployments

https://gerrit.wikimedia.org/r/495861

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:26:00Z] <bstorm_> T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:51:04Z] <bstorm_> T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:53:31Z] <bstorm_> T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T23:28:12Z] <bstorm_> T217280 depooled, reloaded and repooled tools-sgeexec-0938

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T00:39:39Z] <bstorm_> T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T02:09:10Z] <bstorm_> T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T02:31:04Z] <bstorm_> T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening

Since I cannot tell if these tools nodes are NFS, LDAP or both, I'm still recording them here. Honestly, the effects the tools nodes are seeing (besides LDAP failures and login stuff) is similar to T124133 on the NFS side.

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T03:09:55Z] <bstorm_> T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.

After Toolforge Stretch was moved away from seaborgium/serpens, these two servers stopped exhibiting the memory leak symptoms:

Screenshot from 2019-03-23 14-26-41.png (265×874 px, 23 KB)

Screenshot from 2019-03-23 14-27-21.png (265×871 px, 30 KB)

And the new LDAP replicas (T46722) started exhibiting that:

Screenshot from 2019-03-23 14-28-43.png (265×887 px, 21 KB)

Screenshot from 2019-03-23 14-28-54.png (271×885 px, 21 KB)

The improvement so far seems to be that, because the traffic is being distributed by LVS tro the 2 replicas instead of hitting a single LDAP server like before, the slapd restarts are happening at 1/2 of the previous rate (from every ~12 hours to ~24 hours now).

Moving to the new LDAP replicas caused traffic on seaborgium/serpens to drop by roughly half:

Screenshot from 2019-03-23 14-34-30.png (263×869 px, 79 KB)

Screenshot from 2019-03-23 14-34-55.png (260×866 px, 55 KB)

The thread peaks are gone too:

Screenshot from 2019-03-23 14-35-32.png (260×866 px, 31 KB)

Screenshot from 2019-03-23 14-35-45.png (260×867 px, 38 KB)

This seems to suggest:

  • The type of requests that the Grid does hits a particular code path in slapd that causes the memory leaks; or
  • Production traffic also causes memory leaks but it's now at a much (much!) smaller rate (traffic dropped by 50% but the memory leak situation in seaborgium/serpens seems to have improved much more than that);
  • As confirmed by Andrew, the Grid does a lot of LDAP requests when jobs are starting which explains the almost hourly thread count peaks (when users are more likely to schedule cronjobs)

...; or
Production traffic also causes memory leaks

Or non-production non-toolforge traffic :)

Most active talkers to LDAP (in bytes, ~10min packet capture):

  1. 68M deployment-deploy01.deployment-prep.eqiad.wmflabs

...

Faidon and Bryan have patched Keyholder to use a different system call when retrieving groups. The calles are now cacheable by nscd. So that machine should no more show up in the top list.

Just drive-by checking in on UBN!s: is this task still UBN!?

bd808 lowered the priority of this task from Unbreak Now! to High.Apr 1 2019, 10:32 PM

Just drive-by checking in on UBN!s: is this task still UBN!?

I think we can move it down to "High". Its still an active problem but T46722: Add two read-only LDAP servers in eqiad gave us a bit more breathing room.

Is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496991/ still on the table for merging, or do we assume it won't really fix anything?

I have lost tracks of all the effort made but roughly the summary would be:

  • upgrade of ldap servers to Stretch
  • tweak of nscd caches
  • raise TTL of groups lookup from 60 to 300
  • two new ldap replica added
  • Toolforge Stretch moved away from seaborgium/serpens (apparently to some replica)

seabborgium has way less CPU usage or memory leak since 03/22 ~ 17:00 UTC, which match Tools Grid being moved to replicas, which itself seems to have solved the timeout issues for other WMCS instances.

@bd808 patch disable group member list expansion on Stretch clients is probably worth trying yes https://gerrit.wikimedia.org/r/496991 is most probably worth trying. That should drop the load on the LDAP replica ldap-eqiad-replica01 and ldap-eqiad-replica02. Seems that is also tracked in sub task T217838

@bd808 patch disable group member list expansion on Stretch clients is probably worth trying yes https://gerrit.wikimedia.org/r/496991 is most probably worth trying. That should drop the load on the LDAP replica ldap-eqiad-replica01 and ldap-eqiad-replica02. Seems that is also tracked in sub task T217838

I should abandon that I think. We tried it with a cherry-pick and it really didn't seem to change things. The "real" fix we are working towards is changing the LDAP nss integration to use sssd instead of nslcd. That has been partially done in to Toolforge but is blocked for complete switchover by T221225: sssd integration needs to be updated to include sudo config from LDAP support which @aborrero is still working on.

Change 496991 abandoned by BryanDavis:
ldap: disable group member list expansion on Stretch clients

Reason:
sssd is a much better fix.

https://gerrit.wikimedia.org/r/496991

aborrero claimed this task.

It seems our long term plan is to introduce sssd to all VMs in CloudVPS. There are some stuff to solve yet, like integration with Toolforge kubernetes (the current Debian Jessie workers), but probably we will just wait until we have a new kubernetes cluster in Toolforge based on a modern k8s version (and based on Debian Stretch).

Closing task now. Anybody feel free to reopen if required.

+1 on resolution. I haven't encounter any time out since the sprint of action in March 2019. Thank you!