Page MenuHomePhabricator

Keyholder retrieves all groups on each connection which is slow with LDAP
Closed, ResolvedPublic

Description

From integration-cumin.integration.eqiad.wmflabs, cumin takes a while to execute even the most simple command. The labs project has 35 instances, a little try with the date command:

$ date; time sudo cumin --debug --force '*' 'date'
Tue Sep 18 10:38:33 UTC 2018
35 hosts will be targeted:
<snip>
real	0m36.435s
user	0m1.056s
sys	0m0.328s

Some hosts replies after just 6 seconds, but most takes 30+ seconds. To the best of my knowledges they are jessie hosts.

Stopping nslcd resolves that, from T204681#4593029:

sudo systemctl stop nslcd
$ date; time sudo cumin --debug --force '*' 'date'
Tue Sep 18 10:59:15 UTC 2018
35 hosts will be targeted:
<snip>
real	0m2.889s
user	0m1.024s
sys	0m0.332s

I track it to keyholder ssh-agent-proxy, upon connection by a user, the proxy gets the user/group and retrieve all groups with grp.getgrall(), that requests all groups from the labs LDAP and it takes a while.

modules/keyholder/files/ssh-agent-proxy
class SshAgentProxyHandler(socketserver.BaseRequestHandler):
    """This class is responsible for handling an individual connection
    to an SshAgentProxyServer."""

    def get_peer_credentials(self, sock):
        """Return the user and group name of the peer of a UNIX socket."""
        ucred = sock.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, s_ucred.size)
        _, uid, gid = s_ucred.unpack(ucred)
        user = pwd.getpwuid(uid).pw_name
        groups = {grp.getgrgid(gid).gr_name}
        groups.update(g.gr_name for g in grp.getgrall() if user in g.gr_mem)
        return user, groups

On labs:

time python3 -c 'import grp; grp.getgrall()'

real	0m0.939s
user	0m0.040s
sys	0m0.004s

The multiple connection requests to the keyholder end up saturating the local nslcd. The ssh agent proxy thus takes a while to relay them.

Event Timeline

Tailling auth.log from integration-publishing instance while running cumin (from 10.68.18.238):

# Cumin started at Tue Sep 18 10:47:53 UTC 2018

Sep 18 10:47:55 integration-publishing sshd[24623]: Set /proc/self/oom_score_adj to 0
Sep 18 10:47:55 integration-publishing sshd[24623]: Connection from 10.68.18.238 port 59384 on 10.68.23.254 port 22

Sep 18 10:48:00 integration-publishing sshd[24623]: Postponed publickey for root from 10.68.18.238 port 59384 ssh2 [preauth]

Sep 18 10:48:29 integration-publishing sshd[24623]: Accepted publickey for root from 10.68.18.238 port 59384 ssh2: ED25519 xx
Sep 18 10:48:29 integration-publishing sshd[24623]: pam_unix(sshd:session): session opened for user root by (uid=0)
Sep 18 10:48:29 integration-publishing sshd[24623]: Starting session: command for root from 10.68.18.238 port 59384
Sep 18 10:48:29 integration-publishing sshd[24623]: Received disconnect from 10.68.18.238: 11: disconnected by user
Sep 18 10:48:29 integration-publishing sshd[24623]: pam_unix(sshd:session): session closed for user root

Though if I select just that host with 'name:integration-publishing' it is faster:

# Cumin started at Tue Sep 18 10:50:04 UTC 2018

Sep 18 10:50:06 integration-publishing sshd[24652]: Connection from 10.68.18.238 port 59452 on 10.68.23.254 port 22
Sep 18 10:50:07 integration-publishing sshd[24652]: Postponed publickey for root from 10.68.18.238 port 59452 ssh2 [preauth]
Sep 18 10:50:08 integration-publishing sshd[24652]: Accepted publickey for root from 10.68.18.238 port 59452 ssh2: ED25519 xx
Sep 18 10:50:08 integration-publishing sshd[24652]: pam_unix(sshd:session): session opened for user root by (uid=0)
Sep 18 10:50:08 integration-publishing sshd[24652]: Starting session: command for root from 10.68.18.238 port 59452
Sep 18 10:50:08 integration-publishing sshd[24652]: Received disconnect from 10.68.18.238: 11: disconnected by user
Sep 18 10:50:08 integration-publishing sshd[24652]: pam_unix(sshd:session): session closed for user root

Not sure while it would take longer for sshd to accept the publickey, specially knowing it is directly on the target hosts under /etc/ssh/userkeys/root.d/cumin.

What I noted though, is when cumin is running, there is a CPU spike caused by /usr/sbin/nslcd

Watching nslcd in debug

sudo su -
systemctl stop nslcd
nslcd --debug

That shows request(s?) for <group(all)> which dump every single groups or service-groups.

I then killed nslcd and tried again cumin on all hosts. It takes just a few seconds:

$ date; time sudo cumin --debug --force '*' 'date'
Tue Sep 18 10:59:15 UTC 2018
35 hosts will be targeted:
castor02.integration.eqiad.wmflabs,integration-cumin.integration.eqiad.wmflabs,integration-publishing.integration.eqiad.wmflabs,integration-puppetmaster01.integration.eqiad.wmflabs,integration-r-lang-01.integration.eqiad.wmflabs,integration-slave-docker-[1005-1015,1017,1021-1024,1030-1031,1033-1034].integration.eqiad.wmflabs,integration-slave-jessie-[1001-1004].integration.eqiad.wmflabs,integration-slave-jessie-android.integration.eqiad.wmflabs,jenkinstest.integration.eqiad.wmflabs,saucelabs-[01-03].integration.eqiad.wmflabs,webperformance.integration.eqiad.wmflabs
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                              
(19) integration-publishing.integration.eqiad.wmflabs,integration-puppetmaster01.integration.eqiad.wmflabs,integration-slave-docker-[1005-1007,1014-1015,1017,1021-1024,1030-1031,1033-1034].integration.eqiad.wmflabs,integration-slave-jessie-android.integration.eqiad.wmflabs,saucelabs-[02-03].integration.eqiad.wmflabs
----- OUTPUT of 'date' -----                                                                                                                        
Tue Sep 18 10:59:17 UTC 2018                                                                                                                        
===== NODE GROUP =====                                                                                                                              
(16) castor02.integration.eqiad.wmflabs,integration-cumin.integration.eqiad.wmflabs,integration-r-lang-01.integration.eqiad.wmflabs,integration-slave-docker-[1008-1013].integration.eqiad.wmflabs,integration-slave-jessie-[1001-1004].integration.eqiad.wmflabs,jenkinstest.integration.eqiad.wmflabs,saucelabs-01.integration.eqiad.wmflabs,webperformance.integration.eqiad.wmflabs
----- OUTPUT of 'date' -----                                                                                                                        
Tue Sep 18 10:59:18 UTC 2018                                                                                                                        
================                                                                                                                                    
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████| 100% (35/35) [00:01<00:00,  5.58hosts/s]     
FAIL:  |                                                                                                      |   0% (0/35) [00:01<?, ?hosts/s]     
100.0% (35/35) success ratio (>= 100.0% threshold) for command: 'date'.
100.0% (35/35) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

real	0m2.889s
user	0m1.024s
sys	0m0.332s
hashar renamed this task from [cumin] cumin is quite slow on labs project to [cumin] on labs project, nslcd causes cumin to be very slow.Sep 18 2018, 11:01 AM
hashar updated the task description. (Show Details)

This has nothing to do with cumin itself, just how SSH works in the WMCS environment:

root@integration-cumin:~# time SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l root integration-publishing.integration.eqiad.wmflabs "date"
Tue Sep 18 11:01:13 UTC 2018

real	0m1.320s
user	0m0.016s
sys	0m0.004s

And this is because of the sshd configuration that contacts first LDAP. From man sshd_config:

If a key supplied by AuthorizedKeysCommand does not successfully authenticate and authorize the user then public key authentication continues using the usual AuthorizedKeysFile files.

And from /etc/ssh/sshd_config:

AuthorizedKeysFile      /etc/ssh/userkeys/%u /etc/ssh/userkeys/%u.d/cumin
AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup
AuthorizedKeysCommandUser ssh-key-ldap-lookup

I'm marking it as invalid for cumin, but feel free to re-open it as a WMCS task ;)

Other things to keep in mind is that if you run sudo for the first time in a WMCS instance that's usually pretty slow and adds additional seconds before you get the root shell.

AuthorizedKeysCommand is for sshd so that would indeed trigger a run of /usr/sbin/ssh-key-ldap-lookup on each of the client. But the nslcd CPU usage is on the cumin instance integration-cumin, will dig in to figure out what is causing all those ldap queries.

Edit: the ssh command you mentioned does trigger the nslcd ldap lookup:

sudo SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l root integration-publishing.integration.eqiad.wmflabs "date"
hashar added a subscriber: thcipriani.

That is from keyholder apparently. At some point it retrieves all groups with grp.getgrall(), that requests everything from LDAP.

modules/keyholder/files/ssh-agent-proxy
class SshAgentProxyHandler(socketserver.BaseRequestHandler):
    """This class is responsible for handling an individual connection
    to an SshAgentProxyServer."""

    def get_peer_credentials(self, sock):
        """Return the user and group name of the peer of a UNIX socket."""
        ucred = sock.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, s_ucred.size)
        _, uid, gid = s_ucred.unpack(ucred)
        user = pwd.getpwuid(uid).pw_name
        groups = {grp.getgrgid(gid).gr_name}
        groups.update(g.gr_name for g in grp.getgrall() if user in g.gr_mem)
        return user, groups
hashar renamed this task from [cumin] on labs project, nslcd causes cumin to be very slow to Keyholder retrives all groups on each connection which is slow with LDAP.Sep 19 2018, 5:13 PM
hashar renamed this task from Keyholder retrives all groups on each connection which is slow with LDAP to Keyholder retrieves all groups on each connection which is slow with LDAP.
hashar updated the task description. (Show Details)

TLDR: Python grp.getgrall uses POSIX getgrent which in libc has not been made aware of nscd. Hence there is no cache and a LDAP query is made each time....

Long story

Eventually I have looked at cpython 3.4 code. The implementation for
grp.getgrall is in C in a function named grp_getgrall_impl:

Modules/grpmodule.c
/*[clinic input]
grp.getgrall
Return a list of all available group entries, in arbitrary order.
An entry whose name starts with '+' or '-' represents an instruction
to use YP/NIS and may not be accessible via getgrnam or getgrgid.
[clinic start generated code]*/

static PyObject *
grp_getgrall_impl(PyObject *module)
/*[clinic end generated code: output=585dad35e2e763d7 input=d7df76c825c367df]*/
{
    PyObject *d;
    struct group *p;

    if ((d = PyList_New(0)) == NULL)
        return NULL;
    setgrent();
    while ((p = getgrent()) != NULL) {
        PyObject *v = mkgrent(p);
        if (v == NULL || PyList_Append(d, v) != 0) {
            Py_XDECREF(v);
            Py_DECREF(d);
            endgrent();
            return NULL;
        }
        Py_DECREF(v);
    }
    endgrent();
    return d;
}

It uses POSIX getgrent(). A decade or so ago, someone enquired about having nscd to cache getpwent() and getgrent(), a reply by Nicholas Miell on http://libc-alpha.sourceware.narkive.com/ls4gDv6H/nscd-caching-getpwent-and-getgrent#post4

Subject: Re: nscd caching getpwent() and getgrent()

On Thu, 03 Aug 2006 09:11:30 +0200, Petter Reinholdtsen wrote:

> [Ulrich Drepper]
>> Don't bother.  That is one of the most stupid ideas I've read in a
>> while and it won't be added.
> 
> Can you explain why?  Caching a bit more in nscd do not appear like
> one of the stupid ideas I've read, so I wonder why it appear so stupid
> to you.

If I had to guess, I'd say that getpwent() and getgrent() are terribly
designed interfaces that aren't threadsafe, aren't generally usable for
systems with more than a handful of accounts accessed via nss_files, and
are only included because POSIX requires them and ugly old code expects
them to exist. Furthermore, they aren't ever (or certainly shouldn't be)
used in performance critical situations.

Adding caching support to nscd increases complexity with
basically no gain, especially considering that nscd is designed for fast
random read-only access to specific datums, and not writable linear
traversals over an entire dataset.

For the kinds of things that getpwent() and getgrent() are used for,
you're better off using your directory service's native interface (LDAP,
Kerberos, BigTable?, etc.) or something generic like libuser (assuming
that has a sensible interface).

And there is some code about using getgrent_r , though that is not POSIX / is a GNU extension. https://github.com/dun/munge/issues/2#issuecomment-102537422

Change 476424 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/software/keyholder@master] Use os.getgrouplist() for listing a user's groups

https://gerrit.wikimedia.org/r/476424

Change 476424 merged by jenkins-bot:
[operations/software/keyholder@master] Use os.getgrouplist() for listing a user's groups

https://gerrit.wikimedia.org/r/476424

Considering the issues with LDAP, it would be really nice to have this new version of keyholder deployed.

Change 496823 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] keyholder: use os.getgrouplist() for listing a user's groups

https://gerrit.wikimedia.org/r/496823

Change 496823 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] keyholder: use os.getgrouplist() for listing a user's groups

https://gerrit.wikimedia.org/r/496823

hashar claimed this task.
hashar added subscribers: faidon, bd808.

After my lengthy root cause tracking summarized at T204681#4603457, the solution is to migrate from grp.getrall() to os.getgrouplist(). That got fixed via:

WhatWhoPatch
Debian package@faidonhttps://gerrit.wikimedia.org/r/476424
Puppet version@bd808https://gerrit.wikimedia.org/r/496823 (backport of above)

Thank you both, and indeed cumin is way faster now.

Also from T217838, nscd group cache TTL has been bumped from 60 to 300 seconds for positive matches.