LDAP server running out of memory frequently and disrupting Cloud VPS clients
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Feb 27 2019, 7:59 PM

Description

The rate of memory leaking from T130593: investigate slapd memory leak seems to have increased to a point where the band-aid solution of cron'd restarts that was implemented 2 years ago is now causing noticeable client disruption.

original bug report

I maintain dexbot tool in toolforge and it contains lots of tools the community depends on, I received reports that they stopped working and all of my services give 503. (like: https://tools.wmflabs.org/dexbot/d.php). when I try to become dexbot to do the classic "turning webservice off and on again", I get this:

ladsgroup@tools-sgebastion-07:~$ become dexbot
groups: cannot find name for group ID 50062

I became the tool though.

Details

Subject	Repo	Branch	Lines +/-
ldap: disable group member list expansion on Stretch clients	operations/puppet	production	+28 -1
openldap: Use newer slapd from stretch-backports	operations/puppet	production	+9 -0
ldap: Increase nscd max file size	operations/puppet	production	+2 -2
ldap: Increase nscd cache size	operations/puppet	production	+4 -4
openldap: Use newer slapd from stretch-backports	operations/puppet	production	+9 -0
ldap: increase group TTL from 60 to 300 seconds in labs	operations/puppet	production	+1 -1
openldap: Set thread pool based on processor count	operations/puppet	production	+4 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	yuvipanda	T130446 Unable to SSH onto tools-login.wmflabs.org
Resolved	akosiaris	T130593 investigate slapd memory leak
Resolved	aborrero	T217280 LDAP server running out of memory frequently and disrupting Cloud VPS clients
Duplicate	None	T217838 Toolforge Stretch - Increased LDAP utilization
Resolved	aborrero	T218126 LDAP: try how sssd works with our servers
Resolved	Andrew	T218133 Put our ldap servers behind LVS
Resolved	Andrew	T46722 Add two read-only LDAP servers in eqiad
Resolved	akosiaris	T218224 Ganeti request: two new VMs in eqiad for ldap
Resolved	Andrew	T218398 Update openldap profile to use LE
Resolved	aborrero	T221205 Toolforge: deploy sssd to tools-sgewebgrid* nodes
Resolved	aborrero	T221225 sssd integration needs to be updated to include sudo config from LDAP support
Resolved	aborrero	T223067 sudo is still broken on certain toolforge hosts
Resolved	None	T217894 Toolforge Stretch bastion occasionally closes SSH connection just before authentication succeeds
Resolved	aborrero	T224558 sssd: support for Debian Jessie
Resolved	aborrero	T225303 sssd apparently not working fine in toolsbeta

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Two new root emails for jobs that did not launch because of LDAP hiccups and two more queues in disabled state as a result:

scheduling info:            queue instance "task@tools-sgeexec-0925.tools.eqiad.wmflabs" dropped because it is temporarily not available
                            queue instance "task@tools-sgeexec-0909.tools.eqiad.wmflabs" dropped because it is temporarily not available

bd808 renamed this task from groups: cannot find name for group ID to LDAP server running out of memory frequently and disrupting Cloud VPS clients.Mar 7 2019, 5:56 AM

bd808 updated the task description. (Show Details)

The issue is noticeable on the Jenkins job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ since it runs every 10 minutes or so. They all failed looking up 50062, which is the project-bastion group and is fairly large. I would assume the ldap query times out somehow.

For the failures, on contint1001.wikimedia.org:

grep -l 'cannot find name for group' /srv/jenkins/builds/beta-scap-eqiad/*/log|xargs -n1 ls -la|awk '{ print $6 " " $7 " " $8 }'
Feb 28 11:01
Feb 28 11:50
Feb 28 21:30
Mar 1 00:02
Mar 1 11:05
Mar 1 23:25
Mar 2 01:17
Mar 2 07:30
Mar 2 09:15
Mar 2 15:42
Mar 2 23:07
Mar 3 06:55
Mar 3 07:15
Mar 3 07:35
Mar 3 16:20
Mar 3 17:20
Mar 4 06:05
Mar 4 11:45
Mar 4 12:05
Mar 4 17:30
Mar 4 17:45
Mar 5 04:45
Mar 5 21:31
Mar 6 11:18
Mar 6 11:58
Mar 6 15:36
Mar 6 16:48
Mar 6 17:31
Mar 6 18:50
Mar 7 06:00
Mar 7 06:25
Mar 7 07:47

So we still had group/ldap lookup failure after the slapd upgrade and vCPU raise.

The job runs on the WMCS instance deployment-deploy01.deployment-prep.eqiad.wmflabs. nslcd has connections established to both LDAP servers:

deployment-deploy01$ sudo netstat --wide -tp|grep :ldap
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:57762 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:50030 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:39256 seaborgium.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:55294 serpens.wikimedia.org:ldap ESTABLISHED 565/nslcd           
tcp        0      0 deployment-deploy01.deployment-prep.eqiad.wmflabs:34932 seaborgium.wikimedia.org:ldap ESTABLISHED 565/nslcd

I see ldap:// connections timing out for both servers in /var/log/daemon.log

I had a closer look at Grafana eqiad OpenLDAP threads. The metrics come from a Prometheus exporter which exposes metrics listed at https://github.com/jcollie/openldap_exporter#example-output

I could not find an upstream doc about those metrics and ended up looking at the source code from servers/slapd/back-monitor/thread.c. They are not that helpful though but still:

Threads metric	Description from source	Value	Note
Active	Number of active threads	16	Always 16!
Backload	Number of active plus pending threads	363	eg: 347 pending + 16 active
Max	Maximum number of threads as configured	16
Max Pending	Maximum number of pending threads	na	Not captured
Open	Number of open threads	16	Always 16
Pending	Number of pending threads	346	Fluctuates and seems way too large
Starting	Number of threads being started	0
State	Thread pool state	na	Not captured
Runqueue	Queue of running threads - besides those handling operations	na	Not captured
Tasklist	List of running plus standby threads besides those handling operations	na	Not captured

The prometheus exporters does not capture Max Pending, State, Runqueue`, Tasklist, might be worth filling a task to upstream and propose a patch for those.

Regardless. We have a pool of maximum 16 threads, they are always active (Active is always 16). The Pending is rather hide compared to the pool (346 / 16 = 21,625).

The monitoring metrics lacks exposure of the velocity at which the queue is being processed, maybe we can speculate that it takes long enough to process that eventually some request time out? Notably requests for project-bastion takes a while and might be blocking processing of other requests while clients are busy fetching the large response. I noticed that previously with Keyholder which retrieves all groups using a syscall which is not cached by nscd T204681#4598659 , it took a good second to retrieve.

So in the worse case scenario if each threads take one second to reply, with a ratio of 21 lot of replies would take more than 10 seconds which might be a time out somewhere.

Anyway all of that above leads me to believe that we do not process events fast enough and we should raise the pool of threads. From slapd.conf(5):

> `threads <integer>`
>
> Specify the maximum size of the primary thread pool.  The default  is  `16`;  the  minimum value is `2`.

Also from Debian package /usr/share/slapd/slapd.conf:

# The dbconfig settings are used to generate a DB_CONFIG file the first
# time slapd starts.  They do NOT override existing an existing DB_CONFIG
# file.  You should therefore change these settings in DB_CONFIG directly
# or remove DB_CONFIG and restart slapd for changes to take effect.

# For the Debian package we use 2MB as default but be sure to update this
# value if you have plenty of RAM
dbconfig set_cachesize 0 2097152 0

Might be worth raising that database cache to more than 2MBytes since we do have plenty of RAM. Then I don't have cache hits/misses metrics for that.

Mentioned in SAL (#wikimedia-operations) [2019-03-07T11:28:30Z] <gtirloni> updated seaborgium to stretch (T217280)

Change 494911 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Set thread pool based on processor count

https://gerrit.wikimedia.org/r/494911

gerritbot added a project: Patch-For-Review.Mar 7 2019, 11:57 AM

@hashar nice research, thank you! I've submitted a change to increase the number of threads. It does seem we have a backlog of threads periodically and this could help with that.

I haven't submitted a change to increase the cache size yet because that should help with disk reads and I barely see any disk I/O on these servers (which I take to mean the cache is working okay and slapd doesn't have to go to disk for something). My understand was also that this cachesize is the BerkleyDB in-memory cache size. If that's for caching to avoid doing a network requests instead, then my reasoning doesn't apply.

I'm really missing a LDAP latency graph.

Next step is to try Moritz' suggestion and point Toolforge Stretch to a single LDAP server to see how load behaves.

• GTirloni mentioned this in T46722: Add two read-only LDAP servers in eqiad.Mar 7 2019, 12:19 PM

This is impact of the proposed change to thread count, manually applied to both LDAP servers 11:45-12:50 UTC (and reverted). It doesn't help with all the peaks, which still need to be investigated.

Change 494922 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: increase group TTL from 60 to 3600 seconds in labs

https://gerrit.wikimedia.org/r/494922

Paladox subscribed.Mar 7 2019, 2:02 PM

Most active talkers to LDAP (in bytes, ~10min packet capture):

68M deployment-deploy01.deployment-prep.eqiad.wmflabs
61M tools-sgeexec-0914.tools.eqiad.wmflabs
56M tools-sgeexec-0935.tools.eqiad.wmflabs
43M soweego-2.soweego.eqiad.wmflabs
39M tools-sgeexec-0932.tools.eqiad.wmflabs
30M tools-sgeexec-0905.tools.eqiad.wmflabs
28M tools-sgeexec-0922.tools.eqiad.wmflabs
26M tools-sgeexec-0921.tools.eqiad.wmflabs
26M integration-slave-docker-1034.integration.eqiad.wmflabs
24M tools-sgeexec-0912.tools.eqiad.wmflabs
23M tools-sgeexec-0931.tools.eqiad.wmflabs
22M tools-sgeexec-0927.tools.eqiad.wmflabs
21M tools-mail-02.tools.eqiad.wmflabs

For perspective, the first Toolforge Trusty server appears at #56 (tools-bastion-02 with 1.2M)

I added nslcd debug logging on deployment-deploy01:

/etc/nslcd.conf

log /var/log/nslcd.log debug

And disabled puppet. Will let it run for a while then process the results.

• Bstorm mentioned this in T215845: Add monitoring for disabled grid nodes to the prometheus collector.Mar 7 2019, 5:08 PM

Agabi10 subscribed.Mar 7 2019, 5:26 PM

nslcd debug log dumps each of the fields returned for a given query. That can be found by grepping for deref and grouping by the summary of the query (eg: group mbmers lookup group=50062):

$ grep deref nslcd.log|cut -d\  -f3|sort|uniq -c|sort -rn
1478050 <group=50062>
 470060 <group(all)>
  73374 <group=50380>
  13340 <group=50120>
   4347 <group=1003>
   1836 <group=50174>
    714 <group=53809>
    435 <group="project-deployment-prep">
    160 <group=52308>
    140 <group=53013>
    126 <group=50156>
    119 <group=52033>
     98 <group=50074>
     56 <group=52388>
     44 <group=500>
     20 <group=50258>
     14 <group=53839>
      8 <group=53985>
      7 <group=52803>
      6 <group=603>
      6 <group=51691>
      5 <group=53861>
      4 <group=52633>
      4 <group=52325>
      2 <group="mwdeploy">

I extracted group(all) for a single run:

$ grep 6d8d3c nslcd.log |grep -v deref
nslcd: [6d8d3c] DEBUG: connection from pid=382 uid=497 gid=53807
nslcd: [6d8d3c] <group(all)> DEBUG: myldap_search(base="ou=groups,dc=wikimedia,dc=org", filter="(objectClass=posixGroup)")
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=wikidev,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=wmf,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=ops,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=labsadminbots,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-bastion,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-gerrit,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-openstack,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-wikidata-dev,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-account-creation-assistance,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=project-analytics,ou=groups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): ... 196 more results
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): end of results (206 total)
nslcd: [6d8d3c] <group(all)> DEBUG: myldap_search(base="ou=servicegroups,dc=wikimedia,dc=org", filter="(objectClass=posixGroup)")
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.wmib,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.cluebot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.czarchivebot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.rezabot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.asurabot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.stemmeberettigelse,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.logs,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.admin,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.coibot,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): cn=bots.linkwatcher,ou=servicegroups,dc=wikimedia,dc=org
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): ... 2250 more results
nslcd: [6d8d3c] <group(all)> DEBUG: ldap_result(): end of results (2260 total)

And eventually there are 12370 matches for deref for that single query which are hmm, the groups members. If we are just looking for a list of available groups, probably we do not need all their members?

I mentioned it previously sorry. LDAP spam from deployment-deploy01 is due to Keyholder / T204681. It is pending releasing a Debian package and switching from puppet based deployment toward the package.

zhuyifei1999 mentioned this in T217838: Toolforge Stretch - Increased LDAP utilization.Mar 8 2019, 12:54 AM

Removed, wrong ticket

Change 494911 abandoned by GTirloni:
openldap: Set thread pool based on processor count

Reason:
Since we've identified that Toolforge Stretch is doing way more LDAP requests then Toolforge Trusty, it seems increasing slapd threads is not worth the risk of introducing a new variable in this puzzle. Thanks for all the feedback.

https://gerrit.wikimedia.org/r/494911

Mentioned in SAL (#wikimedia-cloud) [2019-03-08T17:49:27Z] <bd808> Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)

Change 495490 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495490

Change 494922 merged by GTirloni:
[operations/puppet@production] ldap: increase group TTL from 60 to 300 seconds in labs

https://gerrit.wikimedia.org/r/494922

Mentioned in SAL (#wikimedia-operations) [2019-03-10T22:35:47Z] <gtirloni> toolforge stretch: increased nscd group TTL from 60 to 300sec (T217280)

Mentioned in SAL (#wikimedia-cloud) [2019-03-11T00:53:53Z] <bd808> Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

In T217280#5013940, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2019-03-10T22:35:47Z] <gtirloni> toolforge stretch: increased nscd group TTL from 60 to 300sec (T217280)

Impact of this has been barely noticeable in the LDAP operations graph, which suggests we may indeed have a situation where nscd/caching is bypassed in SGE.

Mentioned in SAL (#wikimedia-operations) [2019-03-11T12:23:27Z] <gtirloni> updated slapd to version 2.4.47 on serpens (T217280)

Change 495490 abandoned by GTirloni:
openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495490

Change 495861 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openldap: Use newer slapd from stretch-backports

https://gerrit.wikimedia.org/r/495861

Mentioned in SAL (#wikimedia-operations) [2019-03-12T10:18:58Z] <gtirloni> updated slapd to version 2.4.47 on seaborgium (T217280)

bd808 added a subtask: T218126: LDAP: try how sssd works with our servers.Mar 12 2019, 4:43 PM

Andrew mentioned this in T218133: Put our ldap servers behind LVS.Mar 12 2019, 5:36 PM

Andrew added a subtask: T218133: Put our ldap servers behind LVS.

Change 496800 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: Increase nscd cache size

https://gerrit.wikimedia.org/r/496800

Change 496800 merged by GTirloni:
[operations/puppet@production] ldap: Increase nscd cache size

https://gerrit.wikimedia.org/r/496800

Mentioned in SAL (#wikimedia-cloud) [2019-03-15T16:00:38Z] <gtirloni> increased nscd cache size (T217280)

Change 496812 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] ldap: Increase nscd max file size

https://gerrit.wikimedia.org/r/496812

Change 496812 merged by GTirloni:
[operations/puppet@production] ldap: Increase nscd max file size

https://gerrit.wikimedia.org/r/496812

Mentioned in SAL (#wikimedia-cloud) [2019-03-15T21:08:39Z] <bstorm_> cleared error state on several queues T217280

• Mholloway unsubscribed.Mar 15 2019, 9:10 PM

Change 496991 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] ldap: disable group member list expansion on Stretch clients

https://gerrit.wikimedia.org/r/496991

zhuyifei1999 mentioned this in T218468: Quarry is down with 502 Bad Gateway message.Mar 16 2019, 1:38 PM

I'm not seeing any meaningful impact on LDAP operations from our latest changes.

Screenshot from 2019-03-18 11-20-00.png (262×848 px, 61 KB)

• Bstorm mentioned this in T218038: NFS issue affecting Toolforge SGE master.Mar 18 2019, 10:32 PM

Sorry I have lost track of this ticket since March 7th but seems it had ample activity.

It is less of an issue today. Notably the jenkins job that updates beta cluster is no more failing, or at least not as often as it used to (that was dupe task T217539 ).

I still have some group failures from time to time. Eg on ssh login just now:

groups: cannot find name for group ID 50062
groups: cannot find name for group ID 50380

But overall that has largely improved!

Change 495861 abandoned by GTirloni:
openldap: Use newer slapd from stretch-backports

Reason:
This will need to be better coordinated among all LDAP deployments

https://gerrit.wikimedia.org/r/495861

• GTirloni added a project: cloud-services-team (Kanban).Mar 21 2019, 5:14 PM

• GTirloni unsubscribed.Mar 21 2019, 9:06 PM

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:26:00Z] <bstorm_> T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:51:04Z] <bstorm_> T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T21:53:31Z] <bstorm_> T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T23:28:12Z] <bstorm_> T217280 depooled, reloaded and repooled tools-sgeexec-0938

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T00:39:39Z] <bstorm_> T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T02:09:10Z] <bstorm_> T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T02:31:04Z] <bstorm_> T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening

Since I cannot tell if these tools nodes are NFS, LDAP or both, I'm still recording them here. Honestly, the effects the tools nodes are seeing (besides LDAP failures and login stuff) is similar to T124133 on the NFS side.

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T03:09:55Z] <bstorm_> T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.

After Toolforge Stretch was moved away from seaborgium/serpens, these two servers stopped exhibiting the memory leak symptoms:

Screenshot from 2019-03-23 14-26-41.png (265×874 px, 23 KB)

Screenshot from 2019-03-23 14-27-21.png (265×871 px, 30 KB)

And the new LDAP replicas (T46722) started exhibiting that:

Screenshot from 2019-03-23 14-28-43.png (265×887 px, 21 KB)

Screenshot from 2019-03-23 14-28-54.png (271×885 px, 21 KB)

The improvement so far seems to be that, because the traffic is being distributed by LVS tro the 2 replicas instead of hitting a single LDAP server like before, the slapd restarts are happening at 1/2 of the previous rate (from every ~12 hours to ~24 hours now).

Moving to the new LDAP replicas caused traffic on seaborgium/serpens to drop by roughly half:

Screenshot from 2019-03-23 14-34-30.png (263×869 px, 79 KB)

Screenshot from 2019-03-23 14-34-55.png (260×866 px, 55 KB)

The thread peaks are gone too:

Screenshot from 2019-03-23 14-35-32.png (260×866 px, 31 KB)

Screenshot from 2019-03-23 14-35-45.png (260×867 px, 38 KB)

This seems to suggest:

The type of requests that the Grid does hits a particular code path in slapd that causes the memory leaks; or
Production traffic also causes memory leaks but it's now at a much (much!) smaller rate (traffic dropped by 50% but the memory leak situation in seaborgium/serpens seems to have improved much more than that);
As confirmed by Andrew, the Grid does a lot of LDAP requests when jobs are starting which explains the almost hourly thread count peaks (when users are more likely to schedule cronjobs)

...; or
Production traffic also causes memory leaks

Or non-production non-toolforge traffic :)

In T217280#5008363, @GTirloni wrote:

Most active talkers to LDAP (in bytes, ~10min packet capture):

68M deployment-deploy01.deployment-prep.eqiad.wmflabs

...

Faidon and Bryan have patched Keyholder to use a different system call when retrieving groups. The calles are now cacheable by nscd. So that machine should no more show up in the top list.

Legoktm mentioned this in T219051: whois tool: 500 - Internal Server Error.Mar 27 2019, 6:12 AM

Just drive-by checking in on UBN!s: is this task still UBN!?

In T217280#5074689, @greg wrote:

Just drive-by checking in on UBN!s: is this task still UBN!?

I think we can move it down to "High". Its still an active problem but T46722: Add two read-only LDAP servers in eqiad gave us a bit more breathing room.

• GTirloni unsubscribed.Apr 3 2019, 10:18 AM

aborrero closed subtask T218126: LDAP: try how sssd works with our servers as Resolved.Apr 17 2019, 8:43 AM

aborrero closed subtask T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes as Resolved.Apr 17 2019, 12:02 PM

bd808 added a subtask: T217894: Toolforge Stretch bastion occasionally closes SSH connection just before authentication succeeds.Apr 17 2019, 8:47 PM

aborrero closed subtask T221225: sssd integration needs to be updated to include sudo config from LDAP support as Resolved.Apr 23 2019, 9:53 AM

aborrero reopened subtask T221225: sssd integration needs to be updated to include sudo config from LDAP support as Open.Apr 23 2019, 10:15 AM

Is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496991/ still on the table for merging, or do we assume it won't really fix anything?

I have lost tracks of all the effort made but roughly the summary would be:

upgrade of ldap servers to Stretch
tweak of nscd caches
raise TTL of groups lookup from 60 to 300
two new ldap replica added
Toolforge Stretch moved away from seaborgium/serpens (apparently to some replica)

seabborgium has way less CPU usage or memory leak since 03/22 ~ 17:00 UTC, which match Tools Grid being moved to replicas, which itself seems to have solved the timeout issues for other WMCS instances.

@bd808 patch disable group member list expansion on Stretch clients is probably worth trying yes https://gerrit.wikimedia.org/r/496991 is most probably worth trying. That should drop the load on the LDAP replica ldap-eqiad-replica01 and ldap-eqiad-replica02. Seems that is also tracked in sub task T217838

In T217280#5186129, @hashar wrote:

@bd808 patch disable group member list expansion on Stretch clients is probably worth trying yes https://gerrit.wikimedia.org/r/496991 is most probably worth trying. That should drop the load on the LDAP replica ldap-eqiad-replica01 and ldap-eqiad-replica02. Seems that is also tracked in sub task T217838

I should abandon that I think. We tried it with a cherry-pick and it really didn't seem to change things. The "real" fix we are working towards is changing the LDAP nss integration to use sssd instead of nslcd. That has been partially done in to Toolforge but is blocked for complete switchover by T221225: sssd integration needs to be updated to include sudo config from LDAP support which @aborrero is still working on.

Andrew closed subtask T218133: Put our ldap servers behind LVS as Resolved.May 23 2019, 2:53 AM

aborrero closed subtask T221225: sssd integration needs to be updated to include sudo config from LDAP support as Resolved.May 29 2019, 11:06 AM

Change 496991 abandoned by BryanDavis:
ldap: disable group member list expansion on Stretch clients

Reason:
sssd is a much better fix.

https://gerrit.wikimedia.org/r/496991

aborrero closed subtask T225303: sssd apparently not working fine in toolsbeta as Resolved.Jun 19 2019, 11:07 AM

It seems our long term plan is to introduce sssd to all VMs in CloudVPS. There are some stuff to solve yet, like integration with Toolforge kubernetes (the current Debian Jessie workers), but probably we will just wait until we have a new kubernetes cluster in Toolforge based on a modern k8s version (and based on Debian Stretch).

Closing task now. Anybody feel free to reopen if required.

+1 on resolution. I haven't encounter any time out since the sprint of action in March 2019. Thank you!

aborrero closed subtask T224558: sssd: support for Debian Jessie as Resolved.Jul 24 2019, 9:49 AM

bd808 merged a task: T217838: Toolforge Stretch - Increased LDAP utilization.Aug 8 2019, 4:07 AM

bd808 added subscribers: • GTirloni, Andrew.

• Bstorm mentioned this in T286613: check for ldap issues regarding seaborgium network blip for row C configuration change.Jul 23 2021, 3:04 PM

	F28451336: Screenshot from 2019-03-23 14-35-32.png
	Mar 23 2019, 5:40 PM

	F28451338: Screenshot from 2019-03-23 14-35-45.png
	Mar 23 2019, 5:40 PM

	F28451330: Screenshot from 2019-03-23 14-34-55.png
	Mar 23 2019, 5:40 PM

	F28451283: Screenshot from 2019-03-23 14-26-41.png
	Mar 23 2019, 5:40 PM

	F28451285: Screenshot from 2019-03-23 14-27-21.png
	Mar 23 2019, 5:40 PM

	F28451294: Screenshot from 2019-03-23 14-28-43.png
	Mar 23 2019, 5:40 PM

	F28451328: Screenshot from 2019-03-23 14-34-30.png
	Mar 23 2019, 5:40 PM

	F28451297: Screenshot from 2019-03-23 14-28-54.png
	Mar 23 2019, 5:40 PM

LDAP server running out of memory frequently and disrupting Cloud VPS clientsClosed, ResolvedPublicActions