Page MenuHomePhabricator

sudo randomly prompts for passwords in cloud instances when LDAP lookups fail
Closed, ResolvedPublic

Description

Example run:

17:09:35 17:09:35 Started scap: beta-scap-eqiad (build #223032)
17:09:39 17:09:39 Copying to deployment-deploy01.deployment-prep.eqiad.wmflabs from deployment-deploy01.deployment-prep.eqiad.wmflabs
17:09:39 17:09:39 Started rsync common
17:09:44 17:09:44 Finished rsync common (duration: 00m 05s)
17:09:44 17:09:44 Started cache_git_info
17:22:09 17:22:09 Finished cache_git_info (duration: 12m 24s)
17:22:10 17:22:10 Started l10n-update
17:22:10 17:22:10 Updating ExtensionMessages-master.php
17:22:12 17:22:12 Updating LocalisationCache for master using 6 thread(s)
17:23:04 17:23:04 Last output:
17:23:04 sudo: a password is required
17:23:04 17:23:04 Last output:
17:23:04 sudo: a password is required
17:23:04 17:23:04 Finished l10n-update (duration: 00m 54s)
17:23:04 17:23:04 Unhandled error:
17:23:04 Traceback (most recent call last):
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
17:23:04     exit_status = app.main(app.extra_arguments)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 552, in main
17:23:04     return super(Scap, self).main(*extra_args)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 70, in main
17:23:04     self._after_sync_common()
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 575, in _after_sync_common
17:23:04     version, wikidb, self.verbose, self.config)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
17:23:04     return func(*args, **kwargs)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 637, in update_localization_cache
17:23:04     wikidb, cache_dir, use_cores, force=force_rebuild, quiet=quiet_rebuild)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 551, in _call_rebuildLocalisationCache
17:23:04     'out_dir': out_dir
17:23:04   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
17:23:04     self.gen.throw(type, value, traceback)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 541, in sudo_temp_dir
17:23:04     sudo_check_call(owner, 'find "%s" -maxdepth 1 -delete' % dirname)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
17:23:04     return func(*args, **kwargs)
17:23:04   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 497, in sudo_check_call
17:23:04     raise subprocess.CalledProcessError(proc.returncode, cmd)
17:23:04 CalledProcessError: Command 'find "/tmp/scap_l10n_3445030935" -maxdepth 1 -delete' returned non-zero exit status 1
17:23:04 17:23:04 scap failed: CalledProcessError Command 'find "/tmp/scap_l10n_3445030935" -maxdepth 1 -delete' returned non-zero exit status 1 (duration: 13m 29s)

Not sure if ldap is somehow flapping, but this has been self-recovering for a while now. This has not, as far as I know, happened in production so it is very likely something to do with the beta-cluster infrastructure and not scap itself.

The code that is failing is part of sudo_temp_dir a utility contextlib method that's only use is in rebuildLocalisationCache: https://phabricator.wikimedia.org/source/scap/browse/master/scap/tasks.py$522

Event Timeline

The other day I briefly observed that sudo had stopped functioning for me (on a random deployment-prep box, may have been deploy01) too. It resumed working soon after.

Krenair renamed this task from Scap in beta fails occasionally due to permissions to sudo randomly prompts for passwords sometimes in labs instances.Oct 8 2018, 9:26 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-15T15:31:16Z] <andrewbogott> restarting slapd on seaborgium as a test for T205463

sudo is managed by sudo-ldap. The config files it uses are /etc/sudo-ldap.conf and /etc/ldap.conf.

In both cases we specify redundant ldap servers, so I wouldn't expect an ldap flap to cause trouble (and, indeed, I just tried one of the ldap servers to see if that broke anything and it didn't.)

I suspect that the most likely explanation for this is a temporary dns failure. If we can get specific time/date/location of the next time this failure occurs we can dig in some logs and see if there was a corresponding dns issue.

aborrero renamed this task from sudo randomly prompts for passwords sometimes in labs instances to sudo randomly prompts for passwords sometimes in cloud instances.Oct 18 2018, 10:01 AM

The latest is 18-Oct-2018 08:22:57: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/225127/console
location is deployment-deploy01.deployment-prep.eqiad.wmflabs in the main deployment

Also: 17-Oct-2018 18:10:08 https://integration.wikimedia.org/ci/job/beta-scap-eqiad/225074/console

TL;DR; I think we're hitting a bug in nslcd where it tries to reconnect to LDAP after some transient issue and uses the wrong credentials to authenticate for some reason (but I could be missing some details about the internals of nslcd/LDAP).

TL;DR; I cannot reproduce this issue with a different user than jenkins-deploy, pointing to something in the sudoers configuration maybe.

So I broke the CI job a few times while trying to reproduce the issue today, sorry about that:

https://integration.wikimedia.org/ci/job/beta-scap-eqiad/225127/console
https://integration.wikimedia.org/ci/job/beta-scap-eqiad/225139/console

05:22:57 08:22:57 Last output:
05:22:57 sudo: a password is required
05:22:57 08:22:57 Last output:
05:22:57 sudo: a password is required
05:22:57 08:22:57 Finished l10n-update (duration: 01m 09s)
05:22:57 08:22:57 Unhandled error:
05:22:57 Traceback (most recent call last):
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
05:22:57     exit_status = app.main(app.extra_arguments)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 584, in main
05:22:57     return super(Scap, self).main(*extra_args)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 83, in main
05:22:57     self._after_sync_common()
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/main.py", line 607, in _after_sync_common
05:22:57     version, wikidb, self.verbose, self.config)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
05:22:57     return func(*args, **kwargs)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 637, in update_localization_cache
05:22:57     wikidb, cache_dir, use_cores, force=force_rebuild, quiet=quiet_rebuild)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 551, in _call_rebuildLocalisationCache
05:22:57     'out_dir': out_dir
05:22:57   File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
05:22:57     self.gen.throw(type, value, traceback)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 541, in sudo_temp_dir
05:22:57     sudo_check_call(owner, 'find "%s" -maxdepth 1 -delete' % dirname)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
05:22:57     return func(*args, **kwargs)
05:22:57   File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 497, in sudo_check_call
05:22:57     raise subprocess.CalledProcessError(proc.returncode, cmd)
05:22:57 CalledProcessError: Command 'find "/tmp/scap_l10n_166353577" -maxdepth 1 -delete' returned non-zero exit status 1
05:22:57 08:22:57 scap failed: CalledProcessError Command 'find "/tmp/scap_l10n_166353577" -maxdepth 1 -delete' returned non-zero exit status 1 (duration: 14m 06s)

/var/log/auth.log:

Oct 18 08:23:12 deployment-deploy01 sudo: pam_unix(sudo:auth): conversation failed
Oct 18 08:23:12 deployment-deploy01 sudo: pam_unix(sudo:auth): auth could not identify password for [jenkins-deploy]
Oct 18 08:23:12 deployment-deploy01 sudo: pam_ldap(sudo:auth): failed to get password: Authentication failure

Similar issue happens *sometimes* when nslcd is stopped and started (that's how I reproduced it). After starting, it will for some time receive requests but one thread be unable to fulfill them because the connection to the LDAP server failed.

Here are some logs from when I could reproduce the issue and had nslcd in debug mode.

jenkins-deploy@deployment-deploy01:~$ sudo ls

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for jenkins-deploy:

(Note: sometimes logging out of jenkins-deploy and back in made it worked automatically while other sessions I had open continued to fail)

/var/log/auth.log:

Oct 18 12:31:35 deployment-deploy01 sudo: pam_unix(sudo:auth): conversation failed
Oct 18 12:31:35 deployment-deploy01 sudo: pam_unix(sudo:auth): auth could not identify password for [jenkins-deploy]
Oct 18 12:31:35 deployment-deploy01 sudo: pam_ldap(sudo:auth): failed to get password: Authentication failure

/var/log/syslog:

Oct 18 12:26:55 deployment-deploy01 nslcd[26430]: Starting LDAP connection daemon: nslcd.
Oct 18 12:26:55 deployment-deploy01 nslcd[26439]: accepting connections
...
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_start_tls_s()
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_sasl_bind("uid=jenkins-deploy,ou=people,dc=wikimedia,dc=org","***") (uri="ldap://ldap-labs.eqiad.wikimedia.org:389") (ppolicy=yes)
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: got LDAP_CONTROL_PASSWORDPOLICYRESPONSE (No error)
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_parse_result() result: Invalid credentials
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: failed to bind to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389: Invalid credentials
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_unbind()
...
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_start_tls_s()
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_sasl_bind("uid=jenkins-deploy,ou=people,dc=wikimedia,dc=org","***") (uri="ldap://ldap-labs.codfw.wikimedia.org:389") (ppolicy=yes)
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: got LDAP_CONTROL_PASSWORDPOLICYRESPONSE (No error)
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_parse_result() result: Invalid credentials
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: failed to bind to LDAP server ldap://ldap-labs.codfw.wikimedia.org:389: Invalid credentials
Oct 18 12:30:12 deployment-deploy01 nslcd[29172]: [1da317] <authc="jenkins-deploy"> DEBUG: ldap_unbind()

We can see that nslcd fails to connect to ldap-labs.eqiad and then gets the same error from ldap-labs.codfw.

Before these errors ocurred, nslcd had been able to bind to the LDAP server 8 times. Just a second later after the ldap-labs.codfw failure, it was able to bind just fine:

Oct 18 12:30:13 deployment-deploy01 nslcd[29172]: [43a858] <group=50062> DEBUG: ldap_simple_bind_s("cn=proxyagent,ou=profile,dc=wikimedia,dc=org","***") (uri="ldap://ldap-labs.eqiad.wikimedia.org:389")
Oct 18 12:30:13 deployment-deploy01 nslcd[29172]: [43a858] <group=50062> DEBUG: ldap_result(): cn=project-bastion,ou=groups,dc=wikimedia,dc=org
Oct 18 12:30:13 deployment-deploy01 nslcd[29172]: [43a858] <group=50062> DEBUG: deref cn=project-bastion,ou=groups,dc=wikimedia,dc=org member=uid=andrp,ou=people,dc=wikimedia,dc=org -> uid=andrp
Oct 18 12:30:13 deployment-deploy01 nslcd[29172]: [43a858] <group=50062> DEBUG: deref cn=project-bastion,ou=groups,dc=wikimedia,dc=org member=uid=elaragon,ou=people,dc=wikimedia,dc=org -> uid=elaragon

The difference here is that, when it works, nslcd is using ldap_simple_bind_s("cn=proxyagent...") and when it fails, it's using ldap_sasl_bind("uid=jenkins-deploy...").

/etc/nslcd.conf:

# The DN to bind with for normal lookups.
binddn cn=proxyagent,ou=profile,dc=wikimedia,dc=org

and /etc/sudo-ldap.conf:

BINDDN          cn=proxyagent,ou=profile,dc=wikimedia,dc=org

I don't know why it's trying to use the regular user credentials instead of proxyagent's. My understanding is that it should bind use proxyagent and then lookup jenkins-deploy's credentials... not try to use those credentials to connect to LDAP (since it doesn't even have a password from us anyway). In any case, the sudo config in LDAP for "ou=sudoers,cn=deployment-prep,ou=projects,dc=wikimedia,dc=org" should suffice for not requesting a password.

There isn't a whole lot of information from the LDAP servers. These message happen sporadically buy they were recorded right when nslcd failed, which is odd:

Oct 18 12:30:12 serpens slapd[27658]: slap_queue_csn: queueing 0x7f1e97ffdc80 20181018123012.746321Z#000000#002#000000
Oct 18 12:30:12 serpens slapd[27658]: slap_graduate_commit_csn: removing 0x7f1e8caab180 20181018123012.746321Z#000000#002#000000
Oct 18 12:30:12 seaborgium slapd[17122]: slap_queue_csn: queueing 0x7f28ccffac80 20181018123012.563336Z#000000#001#000000
Oct 18 12:30:12 seaborgium slapd[17122]: slap_graduate_commit_csn: removing 0x7f28b1701b50 20181018123012.563336Z#000000#001#000000

It could be related to the replication updating something about the failure that just occurred but I'm not sure.

In any case, there doesn't seem to be anything wrong with the LDAP servers at the time I was able to trigger the failure:
https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=seaborgium&var-datasource=eqiad%20prometheus%2Fops
https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=serpens&var-datasource=codfw%20prometheus%2Fops

I also experimented with using gdb to go into the nslcd's process and killing the file descriptors associated with the LDAP connections. However, nslcd properly recovered and re-opened the connections as expected.

No silver bullet, sorry, but hopefully this provides more information about what might be happening.

bd808 renamed this task from sudo randomly prompts for passwords sometimes in cloud instances to sudo randomly prompts for passwords in cloud instances when LDAP lookups fail.Nov 10 2019, 11:46 PM

I wonder if there is any way to set the equivalent of the -n, --non-interactive as the global default for sudo in all Cloud VPS instances? We could set up a global alias sudo=sudo -n which might help, but it would be more ideal if something can be added to /etc/sudoers that would make it just happen. The -n flag does not stop the "a password is required" error message, but it does disable the password prompt itself.

$ $ sudo -u tools.ato ls

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for bd808:
^C
$ sudo -n -u tools.ato ls
sudo: a password is required
Bstorm triaged this task as Medium priority.Feb 11 2020, 4:58 PM

I wonder if there is any way to set the equivalent of the -n, --non-interactive as the global default for sudo in all Cloud VPS instances?

`man sudoers` exerpt
passwd_tries      The number of tries a user gets to enter his/her pass‐
                  word before sudo logs the failure and exits.  The de‐
                  fault is 3.

Repeating the sudo -u tools.ato ls test from T205463#5799393 after adding Defaults passwd_tries=0 to /etc/sudoers:

$ sudo -u tools.ato ls

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

Sorry, user bd808 is not allowed to execute '/usr/bin/ls' as tools.ato on striker-docker-01.striker.eqiad1.wikimedia.cloud.

The "Sorry, ..." message shows immediately and no prompt is given.

`man sudoers` exerpt
lecture       This option controls when a short lecture will be printed
              along with the password prompt.  It has the following pos‐
              sible values:

              always  Always lecture the user.

              never   Never lecture the user.

              once    Only lecture the user the first time they run sudo.

              If no value is specified, a value of once is implied.
              Negating the option results in a value of never being used.
              The default value is once.

With Defaults lecture="never" also added to /etc/sudoers, things look even a little bit nicer:

$ sudo -u tools.ato ls
Sorry, user bd808 is not allowed to execute '/usr/bin/ls' as tools.ato on striker-docker-01.striker.eqiad1.wikimedia.cloud.

Change 932309 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords

https://gerrit.wikimedia.org/r/932309

Change 932309 merged by Andrew Bogott:

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords

https://gerrit.wikimedia.org/r/932309

Change 932309 merged by Andrew Bogott:

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords

https://gerrit.wikimedia.org/r/932309

Reverted by https://gerrit.wikimedia.org/r/c/operations/puppet/+/932269

Notes from Gerrit:

The validation step fails on Buster. Works on Bullseye for some reason.

The file itself passes validation; the issue seems to be that it's trying to validate the file before it's present:

Error: /Stage[main]/Profile::Wmcs::Instance/File[/etc/sudoers.d/T205463-disable-sudo-password-prompts]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-8203-clq8jl

The validation step fails on Buster. Works on Bullseye for some reason.

The file itself passes validation; the issue seems to be that it's trying to validate the file before it's present:

Error: /Stage[main]/Profile::Wmcs::Instance/File[/etc/sudoers.d/T205463-disable-sudo-password-prompts]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-8203-clq8jl

Per https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-validate_cmd the validation should be done on a temporary file as shown in the log message. It would be nice to have that file to manually check. I wonder if there is something missing like a newline to end the file that upsets visudo on Buster?

Per https://www.puppet.com/docs/puppet/5.5/types/file.html#file-attribute-validate_cmd the validation should be done on a temporary file as shown in the log message. It would be nice to have that file to manually check. I wonder if there is something missing like a newline to end the file that upsets visudo on Buster?

Seems like a trailing newline is the needed magic:

$ lsb_release -d
Description:    Debian GNU/Linux 10 (buster)
$ echo -n 'Defaults passwd_tries=0,lecture="never"' > T205463-disable-sudo-password-prompts20230622-8203-clq8jl
$ sudo /usr/sbin/visudo -cf T205463-disable-sudo-password-prompts20230622-8203-clq8jl
>>> T205463-disable-sudo-password-prompts20230622-8203-clq8jl: syntax error near line 1 <<<
parse error in T205463-disable-sudo-password-prompts20230622-8203-clq8jl near line 1
$ echo 'Defaults passwd_tries=0,lecture="never"' > T205463-disable-sudo-password-prompts20230622-8203-clq8jl
$ sudo /usr/sbin/visudo -cf T205463-disable-sudo-password-prompts20230622-8203-clq8jl
T205463-disable-sudo-password-prompts20230622-8203-clq8jl: parsed OK

Change 932330 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords (take 2)

https://gerrit.wikimedia.org/r/932330

Change 932330 merged by Andrew Bogott:

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords (take 2)

https://gerrit.wikimedia.org/r/932330

Latest patch produces:

Error: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-16121-1gvjlkk' returned 1: 
Error: /Stage[main]/Profile::Wmcs::Instance/File[/etc/sudoers.d/T205463-disable-sudo-password-prompts]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-16121-1gvjlkk' returned 1:  (corrective)

Latest patch produces:

Error: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-16121-1gvjlkk' returned 1: 
Error: /Stage[main]/Profile::Wmcs::Instance/File[/etc/sudoers.d/T205463-disable-sudo-password-prompts]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/sbin/visudo -cqf /etc/sudoers.d/T205463-disable-sudo-password-prompts20230622-16121-1gvjlkk' returned 1:  (corrective)

This time the problem was my use of single quotes rather than double quotes in content => 'Defaults passwd_tries=0,lecture="never"\n',. I had forgotten that Puppet treats '\n' as the literal characters \ and n rather than a newline. This meant that Buster did not see the trailing newline it wanted and all platforms saw two junk characters at the end of the config stanza.

Change 932344 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords (take 3)

https://gerrit.wikimedia.org/r/932344

Change 932344 merged by Andrew Bogott:

[operations/puppet@production] wmcs: Configure sudo not to prompt for passwords (take 3)

https://gerrit.wikimedia.org/r/932344

bd808 claimed this task.

I'm going to call this "done" for two different reasons: