While migrating tools-worker-* nodes to using sssd, I've detected that the sssd version in Jessie doesn't support the config we use for stretch. The config diff seems to be minimal, but enough to create problems.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | yuvipanda | T130446 Unable to SSH onto tools-login.wmflabs.org | |||
Resolved | akosiaris | T130593 investigate slapd memory leak | |||
Resolved | aborrero | T217280 LDAP server running out of memory frequently and disrupting Cloud VPS clients | |||
Resolved | aborrero | T224558 sssd: support for Debian Jessie |
Event Timeline
Change 513091 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] ldap client: sssd: introduce jessie-specific sssd.conf
Change 513091 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] ldap client: sssd: introduce jessie-specific bits in sssd.conf
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T09:58:16Z] <arturo> T224558 reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:01:56Z] <arturo> T224558 add tools-worker-1029 to the nodes pool of k8s
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:09:43Z] <arturo> T224558 disable puppet in all tools-worker- nodes
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:27:19Z] <arturo> T224558 switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:28:27Z] <arturo> T224558 use hiera config in prefix tools-worker for sssd/sudo
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:33:24Z] <arturo> T224558 switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:48:21Z] <arturo> T224558 drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T11:23:19Z] <arturo> T224558 depool tools-worker-1003
Issues continue despite the patch https://gerrit.wikimedia.org/r/513091.
I have now 2 more clues:
- the /etc/ldap.yaml which is used by the ssh key lookup tool is a directory instead of a regular file in a freshly created jessie VM:
root@tools-worker-1002:~# /usr/sbin/ssh-key-ldap-lookup aborrero Traceback (most recent call last): File "/usr/sbin/ssh-key-ldap-lookup", line 138, in <module> main() File "/usr/sbin/ssh-key-ldap-lookup", line 114, in main with open('/etc/ldap.yaml') as f: IOError: [Errno 21] Is a directory: '/etc/ldap.yaml' root@tools-worker-1002:~# file /etc/ldap.yaml /etc/ldap.yaml: directory
That prevents normal SSH from any user.
- there seems to be some missing PAM package. There should be differences in how package dependencies are setup between jessie/stretch.
May 30 11:29:46 tools-worker-1003 sshd[6772]: PAM unable to dlopen(pam_ldap.so): /lib/security/pam_ldap.so: cannot open shared object file: No such file or directory
which is why after changing to sssd/sudo nothing works, including our virsh console trick.
Change 513272 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] sssd: include the /etc/ldap.yaml file
Change 513272 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] sssd: include the /etc/ldap.yaml file
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T11:59:35Z] <arturo> T224558 repool tools-worker-1003 (using sssd/sudo now!)
The pam issue may require a pam-auth-update --force --package run in the server, because there are stale entries in the pam config pointing to pam_ldap.so, which we don't use anymore after switching to sssd.
Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:29:29Z] <arturo> switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)
current status by the time of this comment: The only toolforge Debian Jessie server that is running sssd is tools-worker-1029, that was created explicitly for testing it.
The rest of the Jessie systems aren't running sssd, but the classic stack (nslcd/nscd/sudoldap):
tools-elastic-01.tools.eqiad.wmflabs tools-elastic-02.tools.eqiad.wmflabs tools-elastic-03.tools.eqiad.wmflabs tools-flannel-etcd-01.tools.eqiad.wmflabs tools-flannel-etcd-02.tools.eqiad.wmflabs tools-flannel-etcd-03.tools.eqiad.wmflabs tools-k8s-etcd-01.tools.eqiad.wmflabs tools-k8s-etcd-02.tools.eqiad.wmflabs tools-k8s-etcd-03.tools.eqiad.wmflabs tools-k8s-master-01.tools.eqiad.wmflabs tools-prometheus-01.tools.eqiad.wmflabs tools-prometheus-02.tools.eqiad.wmflabs tools-proxy-03.tools.eqiad.wmflabs tools-proxy-04.tools.eqiad.wmflabs tools-puppetmaster-01.tools.eqiad.wmflabs tools-redis-1001.tools.eqiad.wmflabs tools-redis-1002.tools.eqiad.wmflabs tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs tools-worker-1017.tools.eqiad.wmflabs tools-worker-1018.tools.eqiad.wmflabs tools-worker-1019.tools.eqiad.wmflabs tools-worker-1020.tools.eqiad.wmflabs tools-worker-1021.tools.eqiad.wmflabs tools-worker-1022.tools.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-worker-1025.tools.eqiad.wmflabs tools-worker-1026.tools.eqiad.wmflabs tools-worker-1027.tools.eqiad.wmflabs tools-worker-1028.tools.eqiad.wmflabs
Lowering the priority of this ticket, since this involves more work that originally thought.
Change 528178 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/docker-images/toollabs-images@master] docker: add support for "testing" tags
Change 528178 merged by jenkins-bot:
[operations/docker-images/toollabs-images@master] docker: add support for "testing" tags