Page MenuHomePhabricator

deployment-tin ssh: Connection closed by UNKNOWN
Closed, ResolvedPublic

Description

Can not log into deployment-tin anymore, it drops ssh connection with:

Connection closed by UNKNOWN

Same message as on T107403 (tools labs).

Impact: Jenkins is no more able to add the instance as a slave and the jobs updating beta cluster no more run.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 9 2016, 3:59 PM

A line in pam access.conf has some extra whitespace. Maybe that is a cause?

root@deployment-salt:~ # salt -v 'deployment-tin*' cmd.run 'cat /etc/security/access.conf'
Executing job with jid 20160509160005881034
-------------------------------------------

deployment-tin.deployment-prep.eqiad.wmflabs:
    +:ALL:LOCAL
    + : deploy-service mwdeploy : 10.68.17.240
    -:ALL EXCEPT (project-deployment-prep) root:ALL
Krenair added a subscriber: Krenair.May 9 2016, 4:01 PM

This has been causing issues for me on and off for the past week (?) or so
Works if you connect as root directly

I've gotten these intermittently too.

I think this is a race condition caused by my patch (cherry-picked on beta)

I'll just abandon it.

Mattflaschen-WMF triaged this task as High priority.May 13 2016, 10:53 PM

I removed the cherry-picked patch on friday. Is anyone still experiencing this? If so then it wasn't caused by my patch after all.

It's happening now.

A line in pam access.conf has some extra whitespace. Maybe that is a cause?

root@deployment-salt:~ # salt -v 'deployment-tin*' cmd.run 'cat /etc/security/access.conf'
Executing job with jid 20160509160005881034
-------------------------------------------
deployment-tin.deployment-prep.eqiad.wmflabs:
    +:ALL:LOCAL
    + : deploy-service mwdeploy : 10.68.17.240
    -:ALL EXCEPT (project-deployment-prep) root:ALL

Changing that doesn't appear to make a difference.

Might or might not be related, I have noticed puppet being weird:

Notice: /Stage[main]/Mediawiki::Scap/File[/srv/mediawiki]/group: group changed 'mwdeploy' to 'mwdeploy'

That is the infamous LDAP not being reacheable so puppet "kindly" create a local user / group with a different uid/gid.

Local group has GID 997:

root@deployment-tin:~# grep mwdeploy /etc/group                                                                                                                                                                              
mwdeploy:x:997:

LDAP is different:

root@deployment-tin:~# ldaplist -l group mwdeploy                                                                                                                                                                           
                                                                                                                                                                                                                             
dn: cn=mwdeploy,ou=groups,dc=wikimedia,dc=org                                                                                                                                                                                
        objectClass: groupOfNames                                                                                                                                                                                                                            
        objectClass: posixGroup                                                                                                                                                                                                                              
        objectClass: top                                                                                                                                                                                                                                     
        member: uid=mwdeploy,ou=people,dc=wikimedia,dc=org                                                                                                                                                                                                   
        gidNumber: 603
        cn: mwdeploy
root@deployment-tin:~#

That one is T73480: Prevent puppet from creating local user when they are defined in LDAP.

I have manually deleted the mwdeploy group in /etc/group

Just ran into this again. I realized I already had a shell open to deployment-tin in a different window. Tailed /var/log/auth.log while trying to ssh from the other window:

May 23 19:18:18 deployment-tin sshd[24705]: Connection from 10.68.17.232 port 47530 on 10.68.17.240 port 22
May 23 19:18:18 deployment-tin sshd[24705]: Failed publickey for thcipriani from 10.68.17.232 port 47530 ssh2: RSA 59:bb:a6:a4:eb:f6:21:50:78:ec:b9:93:f1:b4:05:3b
May 23 19:18:18 deployment-tin sshd[24705]: Postponed publickey for thcipriani from 10.68.17.232 port 47530 ssh2 [preauth]
May 23 19:18:20 deployment-tin sshd[24705]: pam_access(sshd:account): access denied for user `thcipriani' from `bastion-01.bastion.eqiad.wmflabs'
May 23 19:18:20 deployment-tin sshd[24705]: Failed publickey for thcipriani from 10.68.17.232 port 47530 ssh2: RSA aa:12:c4:c6:33:be:93:00:6c:94:a3:7f:d7:2c:1d:1a
May 23 19:18:20 deployment-tin sshd[24705]: fatal: Access denied for user thcipriani by PAM account configuration [preauth]

fatal: Access denied for user thcipriani by PAM account configuration [preauth]

seems to point at the /etc/security/access.conf stuff I would guess.

I think the likely culprit would be LDAP lookup failures.

thcipriani closed this task as Resolved.May 24 2016, 4:37 PM
thcipriani claimed this task.

I think I found the problem this morning. There was a local project-deployment-prep group that was shadowning the ldap group. This caused the pam_limits.so module to fail to preauth anyone that wasn't in the local group on deployment-tin:

thcipriani@deployment-salt:~$ sudo salt 'deployment-tin*' cmd.run 'getent group project-deployment-prep'                                                                            
deployment-tin.deployment-prep.eqiad.wmflabs:
    project-deployment-prep:x:52947:twentyafterfour
thcipriani@deployment-salt:~$ sudo salt 'deployment-tin*' cmd.run 'groupdel project-deployment-prep'
deployment-tin.deployment-prep.eqiad.wmflabs:
thcipriani@deployment-salt:~$ sudo salt 'deployment-tin*' cmd.run 'getent group project-deployment-prep'
deployment-tin.deployment-prep.eqiad.wmflabs:
    project-deployment-prep:*:50120:[lots-of-names]

I'm unclear on why this problem went away occasionally and then came back. I'm closing this task for now because all the beta-* jobs seems happy now, but may need to reopen if this recurs.

Well that explains why I never experienced the issue. :-O