Page MenuHomePhabricator

Rancid on netmon1003 unable to login to network devices
Closed, ResolvedPublic

Description

For some reason rancid has been unable to grab router/switch diffs in the last run.

Logs are full of this kind of thing:

cr2-eqsin.wikimedia.org jlogin error: Error: Check your password for cr2-eqsin.wikimedia.org

May be a filesystem permission issue, will investigate further

root@netmon1003:/etc/rancid# sudo -u rancid jlogin -c "show version" cr1-eqiad.wikimedia.org
cr1-eqiad.wikimedia.org
spawn ssh -x -l rancid cr1-eqiad.wikimedia.org
The authenticity of host 'cr1-eqiad.wikimedia.org (2620:0:861:ffff::1)' can't be established.
ECDSA key fingerprint is SHA256:bp2/Rq5JsmtMa8gdzRb0Mr8ZbHpOvaKbdE+6lK/0pJI.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
Host cr1-eqiad.wikimedia.org added to the list of known hosts.
yes
Could not create directory '/var/lib/rancid/.ssh' (Permission denied).

Error: Check your password for cr1-eqiad.wikimedia.org

Event Timeline

cmooney changed the task status from Open to In Progress.Aug 10 2022, 1:31 PM
cmooney triaged this task as Medium priority.
cmooney created this task.

@cmooney Found an interesting behavior regarding the 'rancid' user:

topranks
The systemd file for rancid exports it as an environment var I think
topranks
Environment="SSH_AUTH_SOCK=/run/keyholder/proxy.sock"
topranks
Ok.. so interesting result
topranks
If I get a shell as the "rancid" user
topranks
Then export the above environment variable
topranks
Then SSH to one of the routers manually - I get logged in without any password
topranks
If I do the same but don't set the env variable then I get propted for a variable
topranks
When it works it seems to use the key from "/etc/keyholder.d/rancid"
topranks
And when I look in /var/lib/rancid/core/configs I see it has managed to pull the configs!!
topranks
Bunch of them were made there at 21:12 UTC (so like 4 mins back)

Definitely an odd issue.

For comparisons sake we can see that netmon1002 was also trying to save the host key, but it continued after the failure, whereas netmon1003 was bombing out.

root@netmon1002:/var/lib/rancid/bin# sudo -u rancid SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/rancid  cr1-eqiad.wikimedia.org
Could not create directory '/var/lib/rancid/.ssh'.
The authenticity of host 'cr1-eqiad.wikimedia.org (2620:0:861:ffff::1)' can't be established.
ECDSA key fingerprint is SHA256:bp2/Rq5JsmtMa8gdzRb0Mr8ZbHpOvaKbdE+6lK/0pJI.
Are you sure you want to continue connecting (yes/no)? yes
Failed to add the host to the list of known hosts (/var/lib/rancid/.ssh/known_hosts).
Last login: Wed Aug 10 21:22:42 2022 from 2620:0:861:2:208:80:154:141
--- JUNOS 17.3R3-S8.1 Kernel 64-bit  JNPR-10.3-20200425.36498d6_buil
{master}
rancid@re0.cr1-eqiad>

There was some discussion on irc and interesting observations from Daniel about changes to OpenSSH betwen buster and bullseye which might account for the different behavior if creating the ".ssh" directory failed:

ssh 7.9:
  if (mkdir(buf, 0700) < 0)
                                error("Could not create directory '%.200s'.",
                                    buf);
ssh 8.4:
                if (mkdir(dotsshdir, 0700) == -1)
                        error("Could not create directory '%.200s' (%s).",
                            dotsshdir, strerror(errno));
(git clone https://salsa.debian.org/ssh-team/openssh ; git checkout bullseye    vs   git checkout buster)
previously this was in ssh.c  and now it's in hostfile.c

Not 100% if the change was the above or not. But seems likely a change in OpenSSH is now causing the process to exit if it can't create the ".ssh" dir for the known_hosts file, whereas previously this was happening but the exchange continued.

@andrea.denisse I believe is going to make changes to puppet to ensure that /var/lib/rancid is owned by the rancid user, which will prevent this.

@ayounsi do you anticipate any fallout from this? Previously device host keys were not getting saved to known_hosts by rancid. So if we replaced a device and it came back with same name rancid would still work. This change might cause the connection to fail in such a circumstance. I reckon it's likely ok, we will get an email and can manually delete the known_hosts file or entry. Or alternately we could change ssh_config to not check host keys, but probably best to avoid that I reckon.

Confirmed this. The behaviour changed in the newer openssh version in bullseye it seems.

On buster we have 7.9, on bullseye we have 8.4

In buster we have in ssh.c

if (mkdir(buf, 0700) < 0)
        error("Could not create directory '%.200s'.",
            buf);

in bullseye we have, in hostfile.c:

if (mkdir(dotsshdir, 0700) == -1)
        error("Could not create directory '%.200s' (%s).",
            dotsshdir, strerror(errno));

(Checked with: git clone https://salsa.debian.org/ssh-team/openssh ; git checkout bullseye vs git checkout buster)

Change 822196 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Create the OpenSSH directory inside the rancid home directory

https://gerrit.wikimedia.org/r/822196

Hello team, after further testing it the least disruptive and simplest approach is to create the .ssh directory using Puppet.

It needs '700' permissions and user:group rancid:rancid. Once the directory is created the rancid-differ service and invoking rancid manually (sudo -u rancid SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/rancid cr1-eqiad.wikimedia.org) works as expected.

Special thanks to @Dzahn and @cmooney for their help with troubleshooting the issue.

The patch #822196 fixes the issue.

@ayounsi do you anticipate any fallout from this?

I agree that it's better to check host keys, so +1 as long as:

  • there is some kind of alerting/notification, which I think is already there as Rancid will email if it can't reach a device
  • the doc explains that behavior (and how to fix it)

There are some very long term ideas of fetching and centrally managing network devices host keys, but we're not there yet :)

Agreed on the short term fix to create the .ssh directory. However if we were not checking host keys to begin with I think we should keep doing that until we can reliably check host keys. If it is easy to specify ssh options for rancid then we could add -o UserKnownHostsFile=/dev/null which AFAIK does the right thing.

@fgiunchedi yeah that may be an option. I'm not sure how easy it is to change Rancid to add that to the command when running ssh, but I'm sure we could add UserKnownHostsFile /dev/null to /var/lib/rancid/.ssh/config, which ought to have the same effect.

Another oddity here with rancid from netmon1003.

The permission change has removed the problem for most of our estate (all the Juniper devices). However for some reason it is still reporting an issue for our OpenGear console devices:

-----Original Message-----
From: rancid@netmon1003.wikimedia.org <rancid@netmon1003.wikimedia.org> 
Sent: Thursday 11 August 2022 12:41
To: rancid-admin-core@wikimedia.org
Subject: config fetcher problems - core

The following routers have not been successfully contacted for more than 24 hours.
-rw-r--r-- 1 rancid rancid 10556 Aug  9 12:17 scs-eqsin.mgmt.eqsin.wmnet
-rw-r--r-- 1 rancid rancid 10950 Aug  9 12:17 scs-oe16-esams.mgmt.esams.wmnet
-rw-r--r-- 1 rancid rancid 11298 Aug  9 12:17 scs-ulsfo.mgmt.ulsfo.wmnet
-rw-r--r-- 1 rancid rancid 23184 Aug  9 12:17 scs-a8-eqiad.mgmt.eqiad.wmnet
-rw-r--r-- 1 rancid rancid 24948 Aug  9 12:17 scs-a1-codfw.mgmt.codfw.wmnet
-rw-r--r-- 1 rancid rancid 26295 Aug  9 12:17 scs-c1-eqiad.mgmt.eqiad.wmnet
-rw-r--r-- 1 rancid rancid 27487 Aug  9 12:17 scs-c1-codfw.mgmt.codfw.wmnet
-rw-r--r-- 1 rancid rancid 9043 Aug  9 12:17 scs-drmrs.mgmt.drmrs.wmnet

This does not seem to be due to a problem logging on via SSH:

cmooney@netmon1003:~$ sudo -u rancid SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/rancid scs-c1-eqiad.mgmt.eqiad.wmnet
$
$ uname -a
Linux scs-c1-eqiad.mgmt.eqiad.wmnet 3.10.0-uc0 #1 Wed May 26 07:08:02 UTC 2021 armv7l unknown
$
$ exit
logout
Connection to scs-c1-eqiad.mgmt.eqiad.wmnet closed.

So I'm unsure what is happening. But clearly pointing to some quirk with the OpenGears.

Logs suggest a timeout:

scs-oe16-esams.mgmt.esams.wmnet oglogin error: Error: TIMEOUT reached
scs-oe16-esams.mgmt.esams.wmnet: missed cmd(s): config -g config,cat /proc/bus/usb/devices,cat /etc/version,cat /etc/config/shadow
scs-oe16-esams.mgmt.esams.wmnet: End of run not found

But this seems to work ok manually:

cmooney@netmon1003:/$ sudo -u rancid bash
rancid@netmon1003:/$ export SSH_AUTH_SOCK=/run/keyholder/proxy.sock
rancid@netmon1003:/$ /var/lib/rancid/bin/oglogin -c "cat /etc/version" scs-a8-eqiad.mgmt.eqiad.wmnet
scs-a8-eqiad.mgmt.eqiad.wmnet
spawn ssh -c aes256-ctr -x -l rancid scs-a8-eqiad.mgmt.eqiad.wmnet
$
$  cat /etc/version
OpenGear/CM41xx Version 3.16.6u4 9077fc8f --  Fri Mar 10 08:38:27 EST 2017
$exit
logout
Connection to scs-a8-eqiad.mgmt.eqiad.wmnet closed.
rancid@netmon1003:/$

I believe the issue is that the expect script Rancid is running for these is not saying "yes" to accept the host key. This did not happen in my above check as I'd manually ssh'd to the device and accepted it myself prior to running oglogin. Trying that for another SCS (host key not saved yet), reveals what's going on:

rancid@netmon1003:/$ /var/lib/rancid/bin/oglogin -t 30 -c "config -g config;cat /proc/bus/usb/devices;cat /etc/config/shadow;cat /etc/version" scs-c1-codfw.mgmt.codfw.wmnet
scs-c1-codfw.mgmt.codfw.wmnet
spawn ssh -c aes256-ctr -x -l rancid scs-c1-codfw.mgmt.codfw.wmnet
The authenticity of host 'scs-c1-codfw.mgmt.codfw.wmnet (10.193.0.15)' can't be established.
ECDSA key fingerprint is SHA256:qnnBsM9OKwY9QJJcKqZhYiQbQeWI0y/zuKrfNIJPrXU.
Are you sure you want to continue connecting (yes/no/[fingerprint])?
Error: TIMEOUT reached

Creating an SSH config file for the rancid user as follows:

rancid@netmon1003:~$ cat /var/lib/rancid/.ssh/config
Host *
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null

Causes it to work:

rancid@netmon1003:~$ /var/lib/rancid/bin/oglogin -t 30 -c "config -g config;cat /proc/bus/usb/devices;cat /etc/config/shadow;cat /etc/version" scs-c1-codfw.mgmt.codfw.wmnet
scs-c1-codfw.mgmt.codfw.wmnet
spawn ssh -c aes256-ctr -x -l rancid scs-c1-codfw.mgmt.codfw.wmnet
Warning: Permanently added 'scs-c1-codfw.mgmt.codfw.wmnet,10.193.0.15' (ECDSA) to the list of known hosts.
$
$  config -g config
config.alerts.migrated on
config.auth.extendedsessionids on
config.auth.type Local
<-- output cut--->

So I think possibly @fgiunchedi's suggestion to leave host key checking off might be the best way forward, and have puppet create an SSH config file similar to the one I made manually?

cmooney renamed this task from Rancid unable to login to network devices to Rancid on netmon1003 unable to login to network devices.Aug 11 2022, 4:22 PM

Change 822196 merged by Andrea Denisse:

[operations/puppet@production] netmon: Create the OpenSSH directory inside the rancid home directory

https://gerrit.wikimedia.org/r/822196

@andrea.denisse Hey, does you patch correct the other problem I observed above? With the prompt for accepting the host key causing oglogin not to work for SCS devices (Opengear serial console servers)?

It's fixed manually so netmon1003 is working, but the problem will happen again if we create another one I think. The fix I'd suggest is in T314936#8145564

@cmooney Thanks for the heads-up, I missed that part, my bad.

Change 824299 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add the OpenSSH configuration file inside the rancid home directory

https://gerrit.wikimedia.org/r/824299

Change 824417 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:rancid: Drop unneeded dependencies

https://gerrit.wikimedia.org/r/824417

Change 824299 merged by Andrea Denisse:

[operations/puppet@production] netmon: Add the OpenSSH configuration file inside the rancid home directory

https://gerrit.wikimedia.org/r/824299

Change 824417 merged by Andrea Denisse:

[operations/puppet@production] C:rancid: Drop unneeded dependencies

https://gerrit.wikimedia.org/r/824417