Page MenuHomePhabricator

LDAP: try how sssd works with our servers
Closed, ResolvedPublic

Description

We agreed on testing and trying how sssd works with our LDAP servers.

This is specially interesting given the current problems with the load and memory leaks in the servers.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud) [2019-03-15T10:23:32Z] <arturo> create VM arturo-bastion-sssd-test (T218126)

aborrero moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

I started with this basic config:

[sssd]
domains = wikimedia_ldap
services = nss, pam
config_file_version = 2

[nss]
filter_groups = root
filter_users = root

[pam]

[domain/wikimedia_ldap]
id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://ldap-labs.eqiad.wikimedia.org
ldap_search_base = ou=people,dc=wikimedia,dc=org
ldap_tls_reqcert = demand
cache_credentials = true
min_id = 10000
max_id = 20000
enumerate = False
ldap_id_use_start_tls = False
ldap_tls_cacertdir = /etc/openldap/cacerts
ldap_schema = rfc2307bis
ldap_auth_disable_tls_never_use_in_production = true
use_fully_qualified_names = True

also, in nsswitch.conf:

passwd:         files sss
group:          files sss
shadow:         files sss

and removing nscd/nslcd from the testing system. The setup doesn't work yet though.

Ok, I have a working setup.

Install required packages:

apt-get install libpam-sss libnss-sss sssd

Purge nscd, nslcd:

apt-get purge nscd nslcd

Configure stuff:

/etc/sssd/sssd.conf
[sssd]
domains = wikimedia.org
default_domain_suffix = wikimedia.org
services = nss, pam, ssh
config_file_version = 2

[nss]
filter_groups = root
filter_users = root

[pam]

[ssh]

[domain/wikimedia.org]
id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://ldap-eqiad-replica01.wikimedia.org
ldap_default_bind_dn = cn=proxyagent,ou=profile,dc=wikimedia,dc=org
ldap_default_authtok = NotSureIfThisIsMeantToBePublic
ldap_search_base = ou=people,dc=wikimedia,dc=org
ldap_group_search_base = ou=groups,dc=wikimedia,dc=org
ldap_tls_reqcert = demand
cache_credentials = False
enumerate = False
ldap_id_use_start_tls = False
ldap_tls_cacertdir = /etc/openldap/cacerts
ldap_schema = rfc2307bis
use_fully_qualified_names = True
/etc/nssswitch.conf
[...]
passwd:         files sss
group:          files sss
shadow:         files sss
[....]
/etc/ssh/sshd_config
[...]
# disable this
#AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup
#AuthorizedKeysCommandUser ssh-key-ldap-lookup

# write this
AuthorizedKeysCommand /usr/bin/sss_ssh_authorizedkeys
AuthorizedKeysCommandUser nobody
[...]

You can even drop the /etc/ldap/ldap.conf and /etc/ldap.conf files.

Tests:

$ /usr/bin/sss_ssh_authorizedkeys aborrero
ssh-rsa AAAAB3xxxxxx arturo@example.com
$ id aborrero
uid=18194(aborrero@wikimedia.org) gid=500(wikidev@wikimedia.org) [...] (very long list)
$ ssh arturo-bastion-sssd-test.eqiad.wmflabs
[...] just works [...]

This setup doesn't have any specific fine-tunning for our environment, though. I disabled caching and enumeration.
If we are interested I could try cooking a puppet patch to integrate all of this. Or do even more testing on stuff.

You can even drop the /etc/ldap/ldap.conf and /etc/ldap.conf files.

I imagine that would break a few scripts that expect /etc/ldap.conf

$ id aborrero
uid=18194(aborrero@wikimedia.org) gid=500(wikidev@wikimedia.org) [...] (very long list)

I wonder how well this @wikimedia.org suffix stuff will work with keyholder etc. which care about groups.

That bit looks configurable (haven't found @aborrero 's test machine yet to play with it) https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-user-ids

That could be causing trouble with getent etc as well.

That worked:

bstorm@arturo-bastion-sssd-test:~$ id
uid=18713(bstorm) gid=500(wikidev) groups=500(wikidev),etc...

sudo was broken before I did that. It works now. I suspect it's just because our sudo setup expected a bare username. This just changes display, not the name as sssd understands it.

I don't remember. I'm always really nervous about pasting passwords or other sensitive information into public sites.

That worked:

bstorm@arturo-bastion-sssd-test:~$ id
uid=18713(bstorm) gid=500(wikidev) groups=500(wikidev),etc...

sudo was broken before I did that. It works now. I suspect it's just because our sudo setup expected a bare username. This just changes display, not the name as sssd understands it.

Cool, thanks! :-)

Needs discussion: do we want to keep investigating further? Shall I write a puppet patch and try to integrate this into the Toolforge setup?

I think we need to confirm that caching works. My understanding is that nscd/nslcd is working fine, except for caching things.

A good way to test this at the network level is to run tcpdump -n -i eth0 port 389 on the client machine and do things like, open a new SSH session, or run sudo repeatedly and see if the server makes requests to LDAP, what they are, etc.

I think we need to confirm that caching works. My understanding is that nscd/nslcd is working fine, except for caching things.

A good way to test this at the network level is to run tcpdump -n -i eth0 port 389 on the client machine and do things like, open a new SSH session, or run sudo repeatedly and see if the server makes requests to LDAP, what they are, etc.

We would have to define what a good caching policy is for us. @bd808 explicitly mentioned that we aren't interested in credential caching and user enumeration. But caching just usernames/group names may be good. I don't know, I'm not fully aware of all the original problem statement.

Needs discussion: do we want to keep investigating further? Shall I write a puppet patch and try to integrate this into the Toolforge setup?

Do we want to introduce a branch in the manifests for nslcd vs. sssd so toolforge can have a non-standard one? Or should this be considered for all Cloud VPS instances?

After discussion in the team:

  • Useful, but annoying test would be to manually hack up a Stretch grid node to use sssd and then test how the LDAP traffic looks.
  • no caching for credentials is needed. Group membership caching is indeed interesting.
  • Arturo will do some more testing to measure traffic impact on LDAP servers.

Change 498359 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: introduce sssd, replacing nscd/nslcd

https://gerrit.wikimedia.org/r/498359

Mentioned in SAL (#wikimedia-cloud) [2019-03-22T13:59:03Z] <arturo> create VMs arturo-sgeexec-sssd-test-[12] for testing T218126

I have a patch ready: https://gerrit.wikimedia.org/r/c/operations/puppet/+/498359

This is my proposed plan:

  • merge the patch into ops/puppet.git (or a local puppetmaster) the patch has no effect since it requires an explicit hiera key
  • select N hosts and introduce profile::ldap::client::labs::client_stack: sssd into the hiera config
  • observer LDAP servers and client behavior, do some testing, etc.
  • we can revert the patch if we aren't convinced, or make it the default.

That plan sounds good to me! I'd start by doing before/after tests with starting jobs on a tools-sgeexec node.

@Bstorm for easier testing, I have this idea:

  • let's depool a random sge-exec node so no jobs are running there
  • let's create a new sge queue which will be specific for sssd testing. All jobs in this queue will be run in the empty sge-exec node from previous step. I'm not sure if this is even possible or if this fits with our current config layout.
  • then, let's schedule some random-real world tool (chosen by us, some testing tool or whatever) in the new sssd queue. Measure LDAP traffic with nscd/nslcd and without sssd. Clean queue.
  • then, switch the sge-exec node to sssd (my patch). Schedule again the same tool in this queue. Measure LDAP traffic without nscd/nslcd and with sssd.
  • repeat for other tools/other exec nodes as required.

Could you please advice if this is even possible? I know how to do myself most steps in this plan except the ones related to the sge queues and I will need a bit of help with that.

Problem is that I'm unsure how to better control in which node a given job will run, and we need to measure LDAP traffice before/after in that node.

(partially a note to self) Per the team meeting, I'll be creating a test queue to add a server to so that we can do testing without any impact to toolforge workloads.

I have the queue config file ready @aborrero. When you have a server with a novel name created and the exec node role applied to it, I can add the host and add it to the queue manually. A re-run of the grid-configurator script might remove it, but that requires manual actions (and adding the host and queue back is trivial once I've got the config files in my home dir 😁).

Mentioned in SAL (#wikimedia-cloud) [2019-03-26T17:31:37Z] <arturo> T218126 create VM instances tools-sssd-sgeexec-test-[12]

test@tools-sssd-sgeexec-test-1 BI    0/0/50         -NA-     -NA-          au
---------------------------------------------------------------------------------
test@tools-sssd-sgeexec-test-2 BI    0/0/50         -NA-     -NA-          au

The queues are ready, though the hosts report exec procs offline at the moment

Note: the seqnum on these queues is 3, so they should not be used by anything unless specifically requested.

Change 498359 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] WMCS: introduce sssd, replacing nscd/nslcd

https://gerrit.wikimedia.org/r/498359

FYI, the queue name is test. So submit with jsub -q test or similar. If you use qsub, you have to specify -l v_mem=<something>.

Got a couple of emails in rootspam today similar to this:

Job 1191607 caused action: Job 1191607 set to ERROR
 User        = tools.random-featured
 Queue       = test@tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job: can't get password entry for user "tools.random-featured". Either user does not exist or error with NIS/LDAP etc.

Not sure if this means the queue is opened for general usage by mistake?

This email is from a tool I just created:

Job 1193713 caused action: Job 1193713 set to ERROR
 User        = tools.arturo-test-tool
 Queue       = test@tools-sssd-sgeexec-test-2.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job: can't get password entry for user "tools.arturo-test-tool". Either user does not exist or error with NIS/LDAP etc.

It seems sssd is failing to provide some info? Perhaps I forgot some config bit, but this is weird because:

aborrero@tools-sssd-sgeexec-test-2:~$ getent passwd tools.arturo-test-tool
tools.arturo-test-tool:*:54005:54005:tools.arturo-test-tool:/data/project/arturo-test-tool:/bin/bash

aborrero@tools-sssd-sgeexec-test-2:~$ getent group tools.arturo-test-tool
tools.arturo-test-tool:*:54005:aborrero

Definitely, there are some random stuff running in the new queue:

Job 1194187 caused action: Job 1194187 set to ERROR
 User        = tools.dewikinews-rss
 Queue       = test@tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed assumedly before job: can't get password entry for user "tools.dewikinews-rss". Either user does not exist or error with NIS/LDAP etc.

I would try disabling the queue now to avoid random stuff running here.

Mentioned in SAL (#wikimedia-cloud) [2019-03-27T12:15:20Z] <arturo> T218126 aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2' (and 1)

@aborrero: that suggests that those tools are submitting jobs in a way that should really be stopped. I'll look into it. We don't want web queues necessarily running cron jobs (which they would preferentially before the test queues). Basically, a misconfigured job that has no queue requested can run on any queue available. That's why they were running.

I'm going to find out why those jobs ran there. They should not (though grid engine can be kind of stupid unless you specify a requirement is "hard"). It is not open for general usage per se. It's just that a job that is misconfigured can run there. I can try to lock it down more.

Yup, that's why. We have some users who don't specify a queue, which allows the job to run on any queue available. I verified based on the jobs launched in your comments above. Honestly, this might be a great honeypot to find all of them so we can direct them to add -q task to their cron jobs that use qsub. As is their jobs will be running on web queues, continuous queues, etc. They also would have been able to use the restricted "single tool queues" by virtue of the fact that we don't have a highly restrictive setup around the queue names without using jsub/jstart or webservice.

If we can tolerate the spam, I'd almost rather leave the queues up so we can email all the toolforge users who need to add -q task to their cron jobs. There likely aren't very many of them. With the queues disabled, it will cut down on the spam. The queues/hosts are configured properly. The jobs are that run there are misconfigured.

I have an idea as well, I can set your account or your tool as the owner of the queue. Then it will be less schedulable.

I went ahead and fully locked down the queue. arturo-test-tool now owns it fully. This is exactly how dedicated queues are set up...if anything runs on them now, we have something we should fix about our setup.

I re-enabled them to see if anything runs. If it does, we seriously need to fix those jobs.

Thanks! I will follow up with my testing.

Not sure if really important, but there is this puppet agent error in both testing machines:

aborrero@tools-sssd-sgeexec-test-1:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1553774615'
Notice: openstack::clientpackages::mitaka::stretch: no special configuration yet
Notice: /Stage[main]/Openstack::Clientpackages::Mitaka::Stretch/Notify[openstack::clientpackages::mitaka::stretch: no special configuration yet]/message: defined 'message' as 'openstack::clientpackages::mitaka::stretch: no special configuration yet'
Notice: The LDAP client stack for this host is: sssd
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd'
Error: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9
Error: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9
Wrapped exception:
No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock
Error: /Stage[main]/Profile::Toolforge::Grid::Node::Compute::Dedicated/Sonofgridengine::Join[hostgroups-tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs]/File[/hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs]/ensure: change from absent to file failed: Could not set 'file' on ensure: No such file or directory @ dir_s_mkdir - /hostgroups/tools-sssd-sgeexec-test-1.tools.eqiad.wmflabs20190328-7859-gkcgxq.lock at /etc/puppet/modules/sonofgridengine/manifests/join.pp:9
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 42.87 seconds

Some measurements finally.

TL;DR: sssd apparently outnumbered the nscd/nslcd stack by 100x. More tests and measurements may be appropriate.

setup

servers:

  • tools-sssd-sgeexec-test-1.eqiad.wmflabs
  • tools-sssd-sgeexec-test-2.eqiad.wmflabs

sge queue:

aborrero@tools-sgegrid-master:~$ sudo qstat -f
[...]
---------------------------------------------------------------------------------
test@tools-sssd-sgeexec-test-1 BI    0/0/50         0.02     lx-amd64      
---------------------------------------------------------------------------------
test@tools-sssd-sgeexec-test-2 BI    0/0/50         0.03     lx-amd64

testing tool (a simple script trying to generate some LDAP traffic):

#!/bin/bash

echo
echo "#####"
echo
sleep 5
echo "$0 running on $(hostname). PWD = $(pwd)"

echo "Trying to generate some LDAP traffic"
ls -l >/dev/null
sleep 1
ls -l /data/project >/dev/null
sleep 1
getent initgroups tools.arturo-test-tool
getent passwd tools.arturo-test-tool
getent group tools.arturo-test-tool
sleep 1
getent group aborrero
getent passwd aborrero

echo "Sleeping for 1 minute"
sleep 60
echo "Done sleeping for 1 minute"

# done
exit 0

launching the tool jobs (4 instances):

tools.arturo-test-tool@tools-sgebastion-07:~$ jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh ; jsub -q test arturo-test-script.sh

using nscd/nslcd

This is the default config. hiera config in horizon (each VM instance): profile::ldap::client::labs::client_stack: classic

Ensure no sssd daemon is running. Reboot server to ensure clean nscd/nslcd caches.
After the servers have started and done the first LDAP server checks (about 5 to 10 seconds), you can listen for generated LDAP traffic which is the traffic produced by the tool:

aborrero@tools-sssd-sgeexec-test-1:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-1.txt
aborrero@tools-sssd-sgeexec-test-2:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-2.txt

Stop tcpdump after the sge jobs are done. This resulted in:

aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-1.txt 
10135 ldap-traffic-1.txt
aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-2.txt 
10907 ldap-traffic-2.txt

using sssd

hiera config in horizon (each VM instance): profile::ldap::client::labs::client_stack: sssd

A complete server reboot was required after deleting nscd/nslcd because sssd was complaining about nscd socket being still around despite the software not being in the server any more.

Ensure sssd caches are empty (a complete stop and then a start).

aborrero@tools-sssd-sgeexec-test-1:~$ sudo systemctl stop sssd && sudo systemctl start sssd
aborrero@tools-sssd-sgeexec-test-2:~$ sudo systemctl stop sssd && sudo systemctl start sssd

After the daemon has started and done the first checks to the LDAP server checks (about 5 to 10 seconds), you can listen for generated LDAP traffic which is the traffic produced by the tool:

aborrero@tools-sssd-sgeexec-test-1:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-1.txt
aborrero@tools-sssd-sgeexec-test-2:~$ sudo tcpdump -i eth0 tcp port 389 | tee ldap-traffic-2.txt

Stop tcpdump after the sge jobs are done. This resulted in:

aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-1.txt 
41 ldap-traffic-1.txt
aborrero@tools-sssd-sgeexec-test-1:~$ wc -l ldap-traffic-2.txt 
61 ldap-traffic-2.txt

conclusions

The numbers in the nscd/nslcd case are very high compared to those offered by sssd. Outnumbered by 100x or more.

I don't know why is this, but several options:

  • I didn't do a correct measurement of nscd/nslcd, because I didn't wait enough to check my tool traffic and I captured traffic related to the stack startup. This is very unlikely though.
  • nscd/nslcd queries LDAP for more stuff than sssd because our configuration. After all, the sssd config tested here is rather simple. (i.e, not a fair comparison between the 2 stacks)
  • nscd/nslcd has a different (worse) traffic pattern that requires more back and forth with the LDAP server (i.e. the point we were trying to point originally)

How testing could be improved? In general I think the measurements I did were very poor, many loose ends, like: when do I start tcpdump?
A more proper measurement setup would be to:

  • add network accounting in the LDAP servers for these concrete testing hosts (running sssd) + a couple of other sge exec nodes (running nscd/nslcd)
  • have them deal with standard SGE load during a period of 24h
  • compare traffic metrics, ideally with some grafana graphs

Not sure if really important, but there is this puppet agent error in both testing machines:
<snip>

I seem to have resolved this on one of the test hosts with:

root@tools-sssd-sgeexec-test-1:~# mkdir /hostgroups

Seems like probably a puppet issue that @Bstorm might want to look at. I don't know if it's worth re-running your tests with those lockfiles working.

This looks great, as far as the test goes. With that volume of difference, errors in testing are less likely to make a big difference. I'll peek at the puppet thing today.

If we trust the numbers and we want to move forward , I suggest we choose a type of servers (exec nodes?) and switch them all and see what happens.

I'm very curious about why the join puppet class wasn't working (not like it should even be in the general hostgroup...that's a leftover from older code). It was missing the sourcedir param to the class somehow, so it ended up thinking the hostgroup was at /. That makes no sense. Trying to see why.

Found it! I never fixed all of the old code in the dedicated class. Putting up a patch to fix it...just because. I think we should try rolling this out. Exec nodes would be a very big change, but we'd see if it was broken very quickly (and should see a difference on the ldap server). Sounds great to me to start there.

The puppet issue will not affect any of the actually active nodes.

Change 499813 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: fix the dedicated class up a bit

https://gerrit.wikimedia.org/r/499813

Change 499813 merged by Bstorm:
[operations/puppet@production] gridengine: fix the dedicated class up a bit

https://gerrit.wikimedia.org/r/499813

Needs discussion in the WMCS team meeting:

  • it seems we accept numbers were better with sssd than with the classic stack
  • Do we want to deploy sssd into Toolforge? we should decide how to move forward with this.
  • some ideas: implement sssd into a whole class of servers (tools-sgeexec, tools-sgewebgrid-generic, tools-sgewebgrid-lighttpd, any others) and if things go well, implement in the next

result of the meeting:

  • add sssd -> classic stack cleanup code to puppet
  • switch sge-exec nodes one day in my morning (arturo)

Change 502519 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] ldap: include sssd cleanup in the classic stack

https://gerrit.wikimedia.org/r/502519

Change 502519 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] ldap: include sssd cleanup in the classic stack

https://gerrit.wikimedia.org/r/502519

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:27:47Z] <arturo> T218126 hard reboot tools-sgeexec-0932

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:40:59Z] <arturo> T218126 hard reboot tools-sgeexec-0918

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:56:49Z] <arturo> force deleted job 853139 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:57:40Z] <arturo> force deleted job 853968 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T09:58:08Z] <arturo> force deleted job 871945 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:01:59Z] <arturo> T218126 hard reboot tools-sgeexec-0907

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:15:43Z] <arturo> force deleted job 1044941 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:16:39Z] <arturo> force deleted job 1045059 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:19:15Z] <arturo> T218126 hard reboot tools-sgeexec-0914

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:26:13Z] <arturo> force deleted job 1739135 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:26:54Z] <arturo> force deleted job 1739159 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:27:03Z] <arturo> T218126 hard reboot tools-sgeexec-0935

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:42:10Z] <arturo> force deleted job 853604 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:43:57Z] <arturo> T218126 hard reboot tools-sgeexec-0915

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:49:14Z] <arturo> T218126 hard reboot tools-sgeexec-0923

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T10:53:42Z] <arturo> force deleted job 1264656 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:03:31Z] <arturo> T218126 hard reboot tools-sgeexec-0928

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:23:46Z] <arturo> T218126 hard reboot tools-sgeexec-0940

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:42:35Z] <arturo> force deleted job 1262132 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:47:35Z] <arturo> T218126 hard reboot tools-sgeexec-0921

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T11:55:56Z] <arturo> T218126 hard reboot tools-sgeexec-0924

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:02:16Z] <arturo> force deleted job 1044978 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:06:05Z] <arturo> T218126 hard reboot tools-sgeexec-0901

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:27:26Z] <arturo> T218126 hard reboot tools-sgeexec-0925

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:31:47Z] <arturo> T218126 hard reboot tools-sgeexec-0926

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:42:09Z] <arturo> force deleted job 1045055 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T12:53:09Z] <arturo> force deleted job 1011036 because it was stucked (trying to depool exec node for T218126)

Mentioned in SAL (#wikimedia-cloud) [2019-04-10T13:06:37Z] <arturo> T218126 hard reboot tools-sgeexec-0906

Status update:

  • sssd is now running in all tools-sgeexec nodes in Toolforge (40 of them)
  • approx. half (~ 20) required a hard-reboot (instead of soft reboot), probably because stuck procs related to NFS
  • the script I used can be found here: https://wikitech.wikimedia.org/wiki/User:Arturo_Borrero_Gonzalez#sssd_rollout.sh I think it can be reused for next server class
  • all seems OK apparently. All went well in general.

Closing this task as resolved. The try/testing was completed. I will follow up on T221205: Toolforge: deploy sssd to tools-sgewebgrid* nodes and others.