We agreed on testing and trying how sssd works with our LDAP servers.
This is specially interesting given the current problems with the load and memory leaks in the servers.
We agreed on testing and trying how sssd works with our LDAP servers.
This is specially interesting given the current problems with the load and memory leaks in the servers.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T130446 Unable to SSH onto tools-login.wmflabs.org | |||
Resolved | akosiaris | T130593 investigate slapd memory leak | |||
Resolved | aborrero | T217280 LDAP server running out of memory frequently and disrupting Cloud VPS clients | |||
Resolved | aborrero | T218126 LDAP: try how sssd works with our servers |
Mentioned in SAL (#wikimedia-cloud) [2019-03-15T10:23:32Z] <arturo> create VM arturo-bastion-sssd-test (T218126)
I started with this basic config:
[sssd] domains = wikimedia_ldap services = nss, pam config_file_version = 2 [nss] filter_groups = root filter_users = root [pam] [domain/wikimedia_ldap] id_provider = ldap auth_provider = ldap ldap_uri = ldap://ldap-labs.eqiad.wikimedia.org ldap_search_base = ou=people,dc=wikimedia,dc=org ldap_tls_reqcert = demand cache_credentials = true min_id = 10000 max_id = 20000 enumerate = False ldap_id_use_start_tls = False ldap_tls_cacertdir = /etc/openldap/cacerts ldap_schema = rfc2307bis ldap_auth_disable_tls_never_use_in_production = true use_fully_qualified_names = True
also, in nsswitch.conf:
passwd: files sss group: files sss shadow: files sss
and removing nscd/nslcd from the testing system. The setup doesn't work yet though.
Ok, I have a working setup.
Install required packages:
apt-get install libpam-sss libnss-sss sssd
Purge nscd, nslcd:
apt-get purge nscd nslcd
Configure stuff:
[sssd] domains = wikimedia.org default_domain_suffix = wikimedia.org services = nss, pam, ssh config_file_version = 2 [nss] filter_groups = root filter_users = root [pam] [ssh] [domain/wikimedia.org] id_provider = ldap auth_provider = ldap ldap_uri = ldap://ldap-eqiad-replica01.wikimedia.org ldap_default_bind_dn = cn=proxyagent,ou=profile,dc=wikimedia,dc=org ldap_default_authtok = NotSureIfThisIsMeantToBePublic ldap_search_base = ou=people,dc=wikimedia,dc=org ldap_group_search_base = ou=groups,dc=wikimedia,dc=org ldap_tls_reqcert = demand cache_credentials = False enumerate = False ldap_id_use_start_tls = False ldap_tls_cacertdir = /etc/openldap/cacerts ldap_schema = rfc2307bis use_fully_qualified_names = True
[...] passwd: files sss group: files sss shadow: files sss [....]
[...] # disable this #AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup #AuthorizedKeysCommandUser ssh-key-ldap-lookup # write this AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys AuthorizedKeysCommandUser nobody [...]
You can even drop the /etc/ldap/ldap.conf and /etc/ldap.conf files.
Tests:
$ /usr/bin/sss_ssh_authorizedkeys aborrero ssh-rsa AAAAB3xxxxxx arturo@example.com
$ id aborrero uid=18194(aborrero@wikimedia.org) gid=500(wikidev@wikimedia.org) [...] (very long list)
$ ssh arturo-bastion-sssd-test.eqiad.wmflabs [...] just works [...]
This setup doesn't have any specific fine-tunning for our environment, though. I disabled caching and enumeration.
If we are interested I could try cooking a puppet patch to integrate all of this. Or do even more testing on stuff.
Is that NotSureIfThisIsMeantToBePublic https://github.com/wikimedia/labs-private/blob/d21fc1a3a126ca0cc0d0b239e67e5bf7aab8384f/hieradata/labs.yaml#L4 by any chance?
You can even drop the /etc/ldap/ldap.conf and /etc/ldap.conf files.
I imagine that would break a few scripts that expect /etc/ldap.conf
$ id aborrero
uid=18194(aborrero@wikimedia.org) gid=500(wikidev@wikimedia.org) [...] (very long list)
I wonder how well this @wikimedia.org suffix stuff will work with keyholder etc. which care about groups.
That bit looks configurable (haven't found @aborrero 's test machine yet to play with it) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-user-ids
That could be causing trouble with getent etc as well.
That worked:
bstorm@arturo-bastion-sssd-test:~$ id uid=18713(bstorm) gid=500(wikidev) groups=500(wikidev),etc...
sudo was broken before I did that. It works now. I suspect it's just because our sudo setup expected a bare username. This just changes display, not the name as sssd understands it.
I don't remember. I'm always really nervous about pasting passwords or other sensitive information into public sites.
Needs discussion: do we want to keep investigating further? Shall I write a puppet patch and try to integrate this into the Toolforge setup?
I think we need to confirm that caching works. My understanding is that nscd/nslcd is working fine, except for caching things.
A good way to test this at the network level is to run tcpdump -n -i eth0 port 389 on the client machine and do things like, open a new SSH session, or run sudo repeatedly and see if the server makes requests to LDAP, what they are, etc.
We would have to define what a good caching policy is for us. @bd808 explicitly mentioned that we aren't interested in credential caching and user enumeration. But caching just usernames/group names may be good. I don't know, I'm not fully aware of all the original problem statement.
Do we want to introduce a branch in the manifests for nslcd vs. sssd so toolforge can have a non-standard one? Or should this be considered for all Cloud VPS instances?
After discussion in the team:
Change 498359 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: introduce sssd, replacing nscd/nslcd
Mentioned in SAL (#wikimedia-cloud) [2019-03-22T13:59:03Z] <arturo> create VMs arturo-sgeexec-sssd-test-[12] for testing T218126
I have a patch ready: https://gerrit.wikimedia.org/r/c/operations/puppet/+/498359
This is my proposed plan:
That plan sounds good to me! I'd start by doing before/after tests with starting jobs on a tools-sgeexec node.
@Bstorm for easier testing, I have this idea:
Could you please advice if this is even possible? I know how to do myself most steps in this plan except the ones related to the sge queues and I will need a bit of help with that.
Problem is that I'm unsure how to better control in which node a given job will run, and we need to measure LDAP traffice before/after in that node.
(partially a note to self) Per the team meeting, I'll be creating a test queue to add a server to so that we can do testing without any impact to toolforge workloads.
I have the queue config file ready @aborrero. When you have a server with a novel name created and the exec node role applied to it, I can add the host and add it to the queue manually. A re-run of the grid-configurator script might remove it, but that requires manual actions (and adding the host and queue back is trivial once I've got the config files in my home dir 😁).
Mentioned in SAL (#wikimedia-cloud) [2019-03-26T17:31:37Z] <arturo> T218126 create VM instances tools-sssd-sgeexec-test-[12]
test@tools-sssd-sgeexec-test-1 BI 0/0/50 -NA- -NA- au --------------------------------------------------------------------------------- test@tools-sssd-sgeexec-test-2 BI 0/0/50 -NA- -NA- au
The queues are ready, though the hosts report exec procs offline at the moment
Note: the seqnum on these queues is 3, so they should not be used by anything unless specifically requested.
Change 498359 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] WMCS: introduce sssd, replacing nscd/nslcd
FYI, the queue name is test. So submit with jsub -q test or similar. If you use qsub, you have to specify -l v_mem=<something>.
Got a couple of emails in rootspam today similar to this:
Job 1191607 caused action: Job 1191607 set to ERROR User = tools.random-featured Queue = test@tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs Start Time = <unknown> End Time = <unknown> failed assumedly before job: can't get password entry for user "tools.random-featured". Either user does not exist or error with NIS/LDAP etc.
Not sure if this means the queue is opened for general usage by mistake?
This email is from a tool I just created:
Job 1193713 caused action: Job 1193713 set to ERROR User = tools.arturo-test-tool Queue = test@tools-sssd-sgeexec-test-2.tools.eqiad.wmflabs Start Time = <unknown> End Time = <unknown> failed assumedly before job: can't get password entry for user "tools.arturo-test-tool". Either user does not exist or error with NIS/LDAP etc.
It seems sssd is failing to provide some info? Perhaps I forgot some config bit, but this is weird because:
aborrero@tools-sssd-sgeexec-test-2:~$ getent passwd tools.arturo-test-tool tools.arturo-test-tool:*:54005:54005:tools.arturo-test-tool:/data/project/arturo-test-tool:/bin/bash aborrero@tools-sssd-sgeexec-test-2:~$ getent group tools.arturo-test-tool tools.arturo-test-tool:*:54005:aborrero
Definitely, there are some random stuff running in the new queue:
Job 1194187 caused action: Job 1194187 set to ERROR User = tools.dewikinews-rss Queue = test@tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs Start Time = <unknown> End Time = <unknown> failed assumedly before job: can't get password entry for user "tools.dewikinews-rss". Either user does not exist or error with NIS/LDAP etc.
I would try disabling the queue now to avoid random stuff running here.
Mentioned in SAL (#wikimedia-cloud) [2019-03-27T12:15:20Z] <arturo> T218126 aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2' (and 1)
@aborrero: that suggests that those tools are submitting jobs in a way that should really be stopped. I'll look into it. We don't want web queues necessarily running cron jobs (which they would preferentially before the test queues). Basically, a misconfigured job that has no queue requested can run on any queue available. That's why they were running.
I'm going to find out why those jobs ran there. They should not (though grid engine can be kind of stupid unless you specify a requirement is "hard"). It is not open for general usage per se. It's just that a job that is misconfigured can run there. I can try to lock it down more.
Yup, that's why. We have some users who don't specify a queue, which allows the job to run on any queue available. I verified based on the jobs launched in your comments above. Honestly, this might be a great honeypot to find all of them so we can direct them to add -q task to their cron jobs that use qsub. As is their jobs will be running on web queues, continuous queues, etc. They also would have been able to use the restricted "single tool queues" by virtue of the fact that we don't have a highly restrictive setup around the queue names without using jsub/jstart or webservice.
If we can tolerate the spam, I'd almost rather leave the queues up so we can email all the toolforge users who need to add -q task to their cron jobs. There likely aren't very many of them. With the queues disabled, it will cut down on the spam. The queues/hosts are configured properly. The jobs are that run there are misconfigured.
I have an idea as well, I can set your account or your tool as the owner of the queue. Then it will be less schedulable.
I went ahead and fully locked down the queue. arturo-test-tool now owns it fully. This is exactly how dedicated queues are set up...if anything runs on them now, we have something we should fix about our setup.
I re-enabled them to see if anything runs. If it does, we seriously need to fix those jobs.
Not sure if really important, but there is this puppet agent error in both testing machines:
aborrero@tools-sssd-sgeexec-test-1:~$ sudo puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files Info: Applying configuration version '1553774615' Notice: openstack::clientpackages::mitaka::stretch: no special configuration yet Notice: /Stage[main]/Openstack::Clientpackages::Mitaka::Stretch/Notify[openstack::clientpackages::mitaka::stretch: no special configuration yet]/message: defined 'message' as 'openstack::clientpackages::mitaka::stretch: no special configuration yet' Notice: The LDAP client stack for this host is: sssd Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd' Error: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9 Error: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9 Wrapped exception: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock Error: /Stage[main]/Profile::Toolforge::Grid::Node::Compute::Dedicated/Sonofgridengine::Join[hostgroups-tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs]/File[/hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs]/ensure: change from absent to file failed: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9 Info: Stage[main]: Unscheduling all events on Stage[main] Notice: Applied catalog in 42.87 seconds
Some measurements finally.
TL;DR: sssd apparently outnumbered the nscd/nslcd stack by 100x. More tests and measurements may be appropriate.
servers:
sge queue:
aborrero@tools-sgegrid-master:~$ sudo qstat -f [...] --------------------------------------------------------------------------------- test@tools-sssd-sgeexec-test-1 BI 0/0/50 0.02 lx-amd64 --------------------------------------------------------------------------------- test@tools-sssd-sgeexec-test-2 BI 0/0/50 0.03 lx-amd64
testing tool (a simple script trying to generate some LDAP traffic):
#!/bin/bash echo echo "#####" echo sleep 5 echo "$0 running on $(hostname). PWD = $(pwd)" echo "Trying to generate some LDAP traffic" ls -l >/dev/null sleep 1 ls -l /data/project >/dev/null sleep 1 getent initgroups tools.arturo-test-tool getent passwd tools.arturo-test-tool getent group tools.arturo-test-tool sleep 1 getent group aborrero getent passwd aborrero echo "Sleeping for 1 minute" sleep 60 echo "Done sleeping for 1 minute" # done exit 0
launching the tool jobs (4 instances):
tools.arturo-test-tool@tools-sgebastion-07:~$ jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh
This is the default config. hiera config in horizon (each VM instance): profile::ldap::client::labs::client_stack: classic
Ensure no sssd daemon is running. Reboot server to ensure clean nscd/nslcd caches.
After the servers have started and done the first LDAP server checks (about 5 to 10 seconds), you can listen for generated LDAP traffic which is the traffic produced by the tool:
aborrero@tools-sssd-sgeexec-test-1:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-1.txt aborrero@tools-sssd-sgeexec-test-2:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-2.txt
Stop tcpdump after the sge jobs are done. This resulted in:
aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-1.txt 10135 ldap-traffic-1.txt aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-2.txt 10907 ldap-traffic-2.txt
hiera config in horizon (each VM instance): profile::ldap::client::labs::client_stack: sssd
A complete server reboot was required after deleting nscd/nslcd because sssd was complaining about nscd socket being still around despite the software not being in the server any more.
Ensure sssd caches are empty (a complete stop and then a start).
aborrero@tools-sssd-sgeexec-test-1:~$ sudo systemctl stop sssd && sudo systemctl start sssd aborrero@tools-sssd-sgeexec-test-2:~$ sudo systemctl stop sssd && sudo systemctl start sssd
After the daemon has started and done the first checks to the LDAP server checks (about 5 to 10 seconds), you can listen for generated LDAP traffic which is the traffic produced by the tool:
aborrero@tools-sssd-sgeexec-test-1:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-1.txt aborrero@tools-sssd-sgeexec-test-2:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-2.txt
Stop tcpdump after the sge jobs are done. This resulted in:
aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-1.txt 41 ldap-traffic-1.txt aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-2.txt 61 ldap-traffic-2.txt
The numbers in the nscd/nslcd case are very high compared to those offered by sssd. Outnumbered by 100x or more.
I don't know why is this, but several options:
How testing could be improved? In general I think the measurements I did were very poor, many loose ends, like: when do I start tcpdump?
A more proper measurement setup would be to:
I seem to have resolved this on one of the test hosts with:
root@tools-sssd-sgeexec-test-1:~# mkdir /hostgroups
Seems like probably a puppet issue that @Bstorm might want to look at. I don't know if it's worth re-running your tests with those lockfiles working.
This looks great, as far as the test goes. With that volume of difference, errors in testing are less likely to make a big difference. I'll peek at the puppet thing today.
If we trust the numbers and we want to move forward , I suggest we choose a type of servers (exec nodes?) and switch them all and see what happens.
I'm very curious about why the join puppet class wasn't working (not like it should even be in the general hostgroup...that's a leftover from older code). It was missing the sourcedir param to the class somehow, so it ended up thinking the hostgroup was at /. That makes no sense. Trying to see why.
Found it! I never fixed all of the old code in the dedicated class. Putting up a patch to fix it...just because. I think we should try rolling this out. Exec nodes would be a very big change, but we'd see if it was broken very quickly (and should see a difference on the ldap server). Sounds great to me to start there.
The puppet issue will not affect any of the actually active nodes.
Change 499813 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: fix the dedicated class up a bit
Change 499813 merged by Bstorm:
[operations/puppet@production] gridengine: fix the dedicated class up a bit
Needs discussion in the WMCS team meeting:
result of the meeting:
Change 502519 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] ldap: include sssd cleanup in the classic stack
Change 502519 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] ldap: include sssd cleanup in the classic stack
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:27:47Z] <arturo> T218126 hard reboot tools-sgeexec-0932
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:40:59Z] <arturo> T218126 hard reboot tools-sgeexec-0918
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:56:49Z] <arturo> force deleted job 853139 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:57:40Z] <arturo> force deleted job 853968 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:58:08Z] <arturo> force deleted job 871945 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:01:59Z] <arturo> T218126 hard reboot tools-sgeexec-0907
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:15:43Z] <arturo> force deleted job 1044941 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:16:39Z] <arturo> force deleted job 1045059 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:19:15Z] <arturo> T218126 hard reboot tools-sgeexec-0914
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:26:13Z] <arturo> force deleted job 1739135 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:26:54Z] <arturo> force deleted job 1739159 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:27:03Z] <arturo> T218126 hard reboot tools-sgeexec-0935
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:42:10Z] <arturo> force deleted job 853604 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:43:57Z] <arturo> T218126 hard reboot tools-sgeexec-0915
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:49:14Z] <arturo> T218126 hard reboot tools-sgeexec-0923
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:53:42Z] <arturo> force deleted job 1264656 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:03:31Z] <arturo> T218126 hard reboot tools-sgeexec-0928
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:23:46Z] <arturo> T218126 hard reboot tools-sgeexec-0940
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:42:35Z] <arturo> force deleted job 1262132 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:47:35Z] <arturo> T218126 hard reboot tools-sgeexec-0921
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:55:56Z] <arturo> T218126 hard reboot tools-sgeexec-0924
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:02:16Z] <arturo> force deleted job 1044978 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:06:05Z] <arturo> T218126 hard reboot tools-sgeexec-0901
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:27:26Z] <arturo> T218126 hard reboot tools-sgeexec-0925
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:31:47Z] <arturo> T218126 hard reboot tools-sgeexec-0926
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:42:09Z] <arturo> force deleted job 1045055 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:53:09Z] <arturo> force deleted job 1011036 because it was stucked (trying to depool exec node for T218126)
Mentioned in SAL (#wikimedia-cloud) [2019-04-10T13:06:37Z] <arturo> T218126 hard reboot tools-sgeexec-0906
Status update:
Closing this task as resolved. The try/testing was completed. I will follow up on T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes and others.