Page MenuHomePhabricator

Prometheus crontab installations via Puppet failing on Toolforge bastion hosts due to remote crontab wrapper script
Closed, ResolvedPublic

Description

Today I was running clush -w @all 'sudo puppet agent --test' from the host tools-clushmaster-01.eqiad.wmflabs and some issues happened in the output.

In case of tools-bastion-05.tools.eqiad.wmflabs, this was:

[...]
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
tools-webgrid-generic-1401.tools.eqiad.wmflabs: Notice: Finished catalog run in 73.50 seconds
tools-exec-1417.tools.eqiad.wmflabs: Info: Retrieving pluginfacts
tools-bastion-05.tools.eqiad.wmflabs: Notice: /Stage[main]/Diamond/Service[diamond]: Triggered 'refresh' from 1 events
tools-exec-1417.tools.eqiad.wmflabs: Info: Retrieving plugin
tools-exec-1409.tools.eqiad.wmflabs: Notice: Finished catalog run in 95.50 seconds
tools-bastion-05.tools.eqiad.wmflabs: Notice: /Stage[main]/Prometheus::Node_puppet_agent/Cron[prometheus_puppet_agent_stats]/ensure: created
tools-bastion-05.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
tools-bastion-05.tools.eqiad.wmflabs:
tools-bastion-05.tools.eqiad.wmflabs: NOTE: some crontab entries have been modified to grid submissions.
tools-bastion-05.tools.eqiad.wmflabs: You may want to examine the result with 'crontab -e'.
tools-bastion-05.tools.eqiad.wmflabs:
tools-bastion-05.tools.eqiad.wmflabs: Notice: Finished catalog run in 116.87 seconds
tools-exec-1417.tools.eqiad.wmflabs: Info: Loading facts in /var/lib/puppet/lib/facter/interface_primary.rb
tools-exec-1417.tools.eqiad.wmflabs: Info: Loading facts in /var/lib/puppet/lib/facter/puppet_settings.rb
[...]

There seems to be some issue with ssh keys or host authorization.

Related to this issue (same clush run): T179388, T179387

Event Timeline

chasemp edited projects, added Toolforge; removed Tools.

NOTE: some crontab entries have been modified to grid submissions.

This comes from our /usr/local/bin/crontab wrapper script. It sounds like maybe Puppet runs were picking up that path instead of the calling /usr/bin/crontab directly? If the caller's uid is < 499 then /usr/local/bin/crontab is supposed to short circuit and just call /usr/bin/crontab without messing with anything. It sounds like that is not working as hoped on this Puppet run?

$ file /usr/local/bin/crontab
/usr/local/bin/crontab: Perl script, UTF-8 Unicode text executable
$ file /usr/bin/crontab
/usr/bin/crontab: setgid ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, BuildID[sha1]=afd15b9fb429c0b04c546b0c45ae9825435ee8d9, stripped
tools-bastion-02.tools:~/projects/toolforge-survey

See also: T156174: Rewrite /usr/local/bin/crontab in python; fix bugs

We have almost ported the PERL script to Python3. There are some small bugs left to squash there. I need to review https://gerrit.wikimedia.org/r/#/c/383770/ for @zhuyifei1999 and see if that makes things work well enough to finish the transition.

The Permission denied (publickey,hostbased). error is 100% related to the NOTE: some crontab entries have been modified to grid submissions. message. When /usr/local/bin/crontab is used instead of /usr/bin/crontab then the wrapper script uses ssh to connect to tools-cron-01.tools.eqiad.wmflabs to actually store the crontab contents instead of storing them on the localhost.

This is probably related to T170178: Update wikitech Titleblacklist and specifically T170178#3436626 where we had a problem with the username prometheus already being in use in our LDAP directory by a Cloud VPS user and the Prometheus package happily installed anyway.

tools-bastion-02.tools:~/projects/toolforge-survey
bd808$ id prometheus
uid=14736(prometheus) gid=500(wikidev) groups=500(wikidev)
tools-bastion-02.tools:~/projects/toolforge-survey
bd808$ ldap uid=prometheus
dn: uid=prometheus,ou=people,dc=wikimedia,dc=org
objectClass: inetOrgPerson
objectClass: posixAccount
uidNumber: 14736
gidNumber: 500
homeDirectory: /var/lib/prometheus
loginShell: /bin/false
uid: prometheus
cn: Prometheus daemon
sn: Prometheus daemon
description: Hack to clean up T170178
bd808 renamed this task from puppet agent issue with tools-bastion-05.tools.eqiad.wmflabs to Prometheus crontab installations via Puppet failing on Toolforge bastion hosts due to remote crontab wrapper script.Oct 31 2017, 3:12 PM

The resource in the Puppet manifest is:

cron { 'prometheus_puppet_agent_stats':
    ensure  => $ensure,
    user    => 'prometheus',
    command => "/usr/local/bin/prometheus-puppet-agent-stats --outfile ${outfile}",
}

I'm not sure if under the hood Puppet is doing the equivalent of sudo crontab -u prometheus ... or sudo -u prometheus crontab .... Either way this is running in to a bug because /usr/local/bin/crontab is being run instead of /usr/bin/crontab. It doesn't look like there is a way to tell Puppet to use a different $PATH for the execs that happen behind the scenes when a cron resource is applied.

This seems unresolved as of today. I'll try to take a look as soon as possible.

Today I did a puppet run in the fleet and I saw:

tools-bastion-02.tools.eqiad.wmflabs: 
tools-bastion-02.tools.eqiad.wmflabs: NOTE: some crontab entries have been modified to grid submissions.
tools-bastion-02.tools.eqiad.wmflabs:       You may want to examine the result with 'crontab -e'.
tools-bastion-02.tools.eqiad.wmflabs: 
tools-bastion-02.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
tools-bastion-02.tools.eqiad.wmflabs: /usr/local/bin/crontab: unable to execute remote crontab command
tools-bastion-05.tools.eqiad.wmflabs: 
tools-bastion-05.tools.eqiad.wmflabs: NOTE: some crontab entries have been modified to grid submissions.
tools-bastion-05.tools.eqiad.wmflabs:       You may want to examine the result with 'crontab -e'.
tools-bastion-05.tools.eqiad.wmflabs: 
tools-bastion-05.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
tools-bastion-05.tools.eqiad.wmflabs: /usr/local/bin/crontab: unable to execute remote crontab command
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-cloud) [2018-01-19T12:56:55Z] <arturo> the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386

In tools-cron-01 I can see a lot of failed logins in /var/log/auth.log, such as:

Jan 19 12:40:04 tools-cron-01 sshd[31117]: Failed hostbased for root from 10.68.16.44 port 58230 ssh2: ECDSA SHA256:OfgR6GTw8ObBQ1LbS+6NBVik1eEXrpSUvRkKOueUnQc, client user "root", client host "tools-bastion-02.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31117]: Failed hostbased for root from 10.68.16.44 port 58230 ssh2: ED25519 SHA256:EXZBzUKRvNm6Rl6KXBXuJA2HCE+AnMvUyu6KTuT6mf4, client user "root", client host "tools-bastion-02.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31117]: Failed hostbased for root from 10.68.16.44 port 58230 ssh2: RSA SHA256:sK5IPOpWhcdkRiBNg5TiZNPKRrKTn76eVBt+6M7AJdg, client user "root", client host "tools-bastion-02.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31117]: Failed hostbased for root from 10.68.16.44 port 58230 ssh2: DSA SHA256:KxiwiLzTyt5OhiVNtUBmj94z4bFnZAsj4tGV3+w1DGo, client user "root", client host "tools-bastion-02.tools.eqiad.wmflabs"

and

Jan 19 12:40:04 tools-cron-01 sshd[31119]: Failed hostbased for root from 10.68.23.74 port 47716 ssh2: ECDSA SHA256:OfgR6GTw8ObBQ1LbS+6NBVik1eEXrpSUvRkKOueUnQc, client user "root", client host "tools-bastion-05.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31119]: Failed hostbased for root from 10.68.23.74 port 47716 ssh2: ED25519 SHA256:EXZBzUKRvNm6Rl6KXBXuJA2HCE+AnMvUyu6KTuT6mf4, client user "root", client host "tools-bastion-05.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31119]: Failed hostbased for root from 10.68.23.74 port 47716 ssh2: RSA SHA256:sK5IPOpWhcdkRiBNg5TiZNPKRrKTn76eVBt+6M7AJdg, client user "root", client host "tools-bastion-05.tools.eqiad.wmflabs"
Jan 19 12:40:04 tools-cron-01 sshd[31119]: Failed hostbased for root from 10.68.23.74 port 47716 ssh2: DSA SHA256:KxiwiLzTyt5OhiVNtUBmj94z4bFnZAsj4tGV3+w1DGo, client user "root", client host "tools-bastion-05.tools.eqiad.wmflabs"

I will investigate how ssh keys are supposed to work here? Should I simply exchange pub keys?

I think potentially what is happening here is that crontab in the path of root is finding /usr/local/bin/crontab which tries to do remote crontab but HBA potentially doesn't work int his circumstance. Our crontab wrapper should do better about handling the root user?

Yeah reading up this seems to say the same:

The Permission denied (publickey,hostbased). error is 100% related to the NOTE: some crontab entries have been modified to grid submissions. message. When /usr/local/bin/crontab is used instead of /usr/bin/crontab then the wrapper script uses ssh to connect to tools-cron-01.tools.eqiad.wmflabs to actually store the crontab contents instead of storing them on the localhost.

This is probably related to T170178: Update wikitech Titleblacklist and specifically T170178#3436626 where we had a problem with the username prometheus already being in use in our LDAP directory by a Cloud VPS user and the Prometheus package happily installed anyway.

tools-bastion-02.tools:~/projects/toolforge-survey
bd808$ id prometheus
uid=14736(prometheus) gid=500(wikidev) groups=500(wikidev)
tools-bastion-02.tools:~/projects/toolforge-survey
bd808$ ldap uid=prometheus
dn: uid=prometheus,ou=people,dc=wikimedia,dc=org
objectClass: inetOrgPerson
objectClass: posixAccount
uidNumber: 14736
gidNumber: 500
homeDirectory: /var/lib/prometheus
loginShell: /bin/false
uid: prometheus
cn: Prometheus daemon
sn: Prometheus daemon
description: Hack to clean up T170178

It's currently determining system users by uid < 500, because of T45795, and executing local crontab via execv if that is the case. Some ways to fix / workaround this:

  • Make prometheus uid < 500. This would be a mess to adduser; find & chown on all cloud instances? If this could be done without any outages it would be awesome IMO.
  • Whitelist id -u prometheus => 14736 in crontab command so it considers prometheus to be system user. This would be the easiest short-term solution but I don't know if any future scripts will have to distinguish between system users and cloud users, and encounter the same issue.
  • Make root always access the local crontab instead of remote. The remote ssh may not work anyways according to the Permission denied (publickey,hostbased) above, but ideally if a root wants to read a normal could user's crontab they should not have to manually ssh into the cron host, and the crontab command should do it for the root.

In any case, it should be documented that before installing packages that may add users, the username should be checked in LDAP to see if LDAP will mask the username. We can rename the user in LDAP before the install, not after, which may be much more complicated.

I'm not sure if under the hood Puppet is doing the equivalent of sudo crontab -u prometheus ... or sudo -u prometheus crontab ...

Looks like the former.

In any case, it should be documented that before installing packages that may add users, the username should be checked in LDAP to see if LDAP will mask the username. We can rename the user in LDAP before the install, not after, which may be much more complicated.

We can document this all we want, but it's not likely to actually happen for each and every Debian package introduced. I do agree however that it would have been easier to rename away the LDAP conflict before the local files owned by the package generated user were setup as the existing LDAP user. Its kind of interesting to me that useradd -r prometheus from the deb didn't fail when a non-system UID user was found to already exist.

Change 405830 had a related patch set uploaded (by Zhuyifei1999; owner: Zhuyifei1999):
[labs/toollabs@master] crontab: Remove -u parameter, and make uid < 500 always edit local

https://gerrit.wikimedia.org/r/405830

Change 405830 merged by jenkins-bot:
[labs/toollabs@master] crontab: Remove -u parameter, and make uid < 500 always edit local

https://gerrit.wikimedia.org/r/405830

Mentioned in SAL (#wikimedia-cloud) [2018-01-25T05:25:38Z] <arturo> deploying misctools and jobutils 1.29 for T179386

Mentioned in SAL (#wikimedia-cloud) [2018-01-25T23:20:03Z] <arturo> T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'

This bug seems fixed per last run. Closing it now :-) Thanks @bd808 and @zhuyifei1999