Page MenuHomePhabricator

Toolforge Iinstances (maybe only Jessie?) are having issues with NFS/LDAP
Closed, ResolvedPublic

Description

-bash: /home/rush/.bash_login: Operation not permitted
rush@tools-worker-1004:~$

T189001 and T188998 seem to be examples

Event Timeline

current situation is we drained and rebooted 1001-1012, but the problems reported in other tasks effected nodes outside of that range (1020). guc which was restarted via 'webservice restart --backend=kubernetes' seems to be running fine now on 1019.

root@tools-worker-1004:~# su - madhuvishy
-su: /home/madhuvishy/.bash_profile: Operation not permitted
madhuvishy@tools-worker-1004:~$

replag tool was also reporting it could not read replica.my.cnf and a restart of the webservice seems to have brought it back online at 1007

bd808> !log tools.orphantalk Restarting webservice (T188998)
chasemp> bd808: did that start working post restart and if so on what worker?
bd808> chasemp: yes, it is working now and its on ... tools-worker-1003.tools.eqiad.wmflabs
bd808> it was on tools-worker-1013.tools.eqiad.wmflabs before restart
chasemp> bd808: ok ack and that new one is one of the newly rebooted

Today I was doing package upgrades on jessie nodes in toolforge, all recorded in SAL and in T188994.

Upgraded packages are:

aborrero@tools-worker-1020:~$ tail /var/log/apt/history.log | grep Upgrade | sed s/"Upgrade: "//g | sed s/"), "/")"'\n'/g | sort 
base-files:amd64 (8+deb8u5, 8+deb8u10)
bash:amd64 (4.3-11+b1, 4.3-11+deb8u1)
binutils:amd64 (2.25-5, 2.25-5+deb8u1)
ca-certificates:amd64 (20141019+deb8u1, 20141019+deb8u3)
dbus:amd64 (1.8.20-0+deb8u1, 1.8.22-0+deb8u1)
debconf:amd64 (1.5.56, 1.5.56+deb8u1)
debconf-i18n:amd64 (1.5.56, 1.5.56+deb8u1)
debian-archive-keyring:amd64 (2014.3, 2017.5~deb8u1)
e2fslibs:amd64 (1.42.12-1.1, 1.42.12-2+b1)
e2fsprogs:amd64 (1.42.12-1.1, 1.42.12-2+b1)
file:amd64 (5.22+15-2+deb8u1, 5.22+15-2+deb8u3)
gnupg2:amd64 (2.0.26-6, 2.0.26-6+deb8u1)
gnupg-agent:amd64 (2.0.26-6, 2.0.26-6+deb8u1)
initramfs-tools:amd64 (0.120+deb8u2, 0.120+deb8u3)
jq:amd64 (1.4-2.1, 1.4-2.1+deb8u1)
krb5-locales:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4)
libcairo2:amd64 (1.14.0-2.1+deb8u1, 1.14.0-2.1+deb8u2)
libcairo-gobject2:amd64 (1.14.0-2.1+deb8u1, 1.14.0-2.1+deb8u2)
libc-ares2:amd64 (1.10.0-2+deb8u1, 1.10.0-2+deb8u2)
libcomerr2:amd64 (1.42.12-1.1, 1.42.12-2+b1)
libcups2:amd64 (1.7.5-11+deb8u1, 1.7.5-11+deb8u2)
libdb5.3:amd64 (5.3.28-9, 5.3.28-9+deb8u1)
libdbus-1-3:amd64 (1.8.20-0+deb8u1, 1.8.22-0+deb8u1)
libgnutls-deb0-28:amd64 (3.3.8-6+deb8u6, 3.3.8-6+deb8u7)
libgnutls-openssl27:amd64 (3.3.8-6+deb8u6, 3.3.8-6+deb8u7)
libgssapi-krb5-2:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4)
libgtk2.0-0:amd64 (2.24.25-3+deb8u1, 2.24.25-3+deb8u2)
libgtk2.0-common:amd64 (2.24.25-3+deb8u1, 2.24.25-3+deb8u2)
libhogweed2:amd64 (2.7.1-5+deb8u1, 2.7.1-5+deb8u2)
libicu52:amd64 (52.1-8+deb8u5, 52.1-8+deb8u6)
libk5crypto3:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4)
libkrb5-3:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4)
libkrb5support0:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4)
libltdl7:amd64 (2.4.2-1.11, 2.4.2-1.11+b1)
libmagic1:amd64 (5.22+15-2+deb8u1, 5.22+15-2+deb8u3)
libncurses5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2)
libncursesw5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2)
libnettle4:amd64 (2.7.1-5+deb8u1, 2.7.1-5+deb8u2)
libnss-ldapd:amd64 (0.9.4-3+deb8u1, 0.9.4-3+deb8u2)
libpng12-0:amd64 (1.2.50-2+deb8u2, 1.2.50-2+deb8u3)
libpython2.7:amd64 (2.7.9-2, 2.7.9-2+deb8u1)
libpython2.7-minimal:amd64 (2.7.9-2, 2.7.9-2+deb8u1)
libpython2.7-stdlib:amd64 (2.7.9-2, 2.7.9-2+deb8u1)
libruby2.1:amd64 (2.1.5-2+deb8u2, 2.1.5-2+deb8u3)
libsqlite3-0:amd64 (3.8.7.1-1+deb8u1, 3.8.7.1-1+deb8u2)
libss2:amd64 (1.42.12-1.1, 1.42.12-2+b1)
libsystemd0:amd64 (215-17+deb8u4, 215-17+deb8u7)
libtinfo5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2)
libudev1:amd64 (215-17+deb8u4, 215-17+deb8u7)
libx11-6:amd64 (1.6.2-3, 1.6.2-3+deb8u1)
libx11-data:amd64 (1.6.2-3, 1.6.2-3+deb8u1)
libxfixes3:amd64 (5.0.1-2+b2, 5.0.1-2+deb8u1)
libxi6:amd64 (1.7.4-1+b2, 1.7.4-1+deb8u1)
libxrandr2:amd64 (1.4.2-1+b1, 1.4.2-1+deb8u1)
libxslt1.1:amd64 (1.1.28-2+deb8u2, 1.1.28-2+deb8u3)
ncurses-base:amd64 (5.9+20140913-1, 5.9+20140913-1+deb8u2)
ncurses-bin:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2)
ncurses-term:amd64 (5.9+20140913-1, 5.9+20140913-1+deb8u2)
openssh-client:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4)
openssh-server:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4)
openssh-sftp-server:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4)
python2.7:amd64 (2.7.9-2, 2.7.9-2+deb8u1)
python2.7-minimal:amd64 (2.7.9-2, 2.7.9-2+deb8u1)
python-crypto:amd64 (2.6.1-5+b2, 2.6.1-5+deb8u1)
ruby2.1:amd64 (2.1.5-2+deb8u2, 2.1.5-2+deb8u3)
sed:amd64 (4.2.2-4+b1, 4.2.2-4+deb8u1)
sudo-ldap:amd64 (1.8.10p3-1+deb8u4, 1.8.10p3-1+deb8u5)
systemd:amd64 (215-17+deb8u4, 215-17+deb8u7)
systemd-sysv:amd64 (215-17+deb8u4, 215-17+deb8u7)
udev:amd64 (215-17+deb8u4, 215-17+deb8u7)
vim:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3)
vim-common:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3)
vim-runtime:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3)
vim-tiny:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3)
w3m:amd64 (0.5.3-19, 0.5.3-19+deb8u2)

During the operation of upgrade, an issue happened. In some servers (not all) there was some race in the dpkg lock between apt-upgrade and puppet.

Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. I had to manually kill dpkg and resume config using something like this (clush @ jessie):

sudo pkill dpkg ; sudo DEBIAN_FRONTEND=noninteractive dpkg --configure -a

The libnss-ldap package has a debconf prompt, which was involved.

After configuration, I finished the upgrades in the nodes that were left behind due to the previous error, until all jessie nodes were upgraded.

dpkg.log:2018-03-06 13:10:08 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u1 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:08 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u1
dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u1
dpkg.log:2018-03-06 13:10:08 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u1
dpkg.log:2018-03-06 13:10:08 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u1
dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:17 upgrade sudo-ldap:amd64 1.8.10p3-1+deb8u4 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:17 status half-configured sudo-ldap:amd64 1.8.10p3-1+deb8u4
dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u4
dpkg.log:2018-03-06 13:10:17 status half-installed sudo-ldap:amd64 1.8.10p3-1+deb8u4
dpkg.log:2018-03-06 13:10:17 status half-installed sudo-ldap:amd64 1.8.10p3-1+deb8u4
dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:23 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none>
dpkg.log:2018-03-06 13:10:23 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:23 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:24 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 13:10:34 configure sudo-ldap:amd64 1.8.10p3-1+deb8u5 <none>
dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status half-configured sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 13:10:34 status installed sudo-ldap:amd64 1.8.10p3-1+deb8u5
dpkg.log:2018-03-06 15:21:14 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u2 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none>
dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:14 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:15 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u2 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:40 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none>
dpkg.log:2018-03-06 15:21:40 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:40 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2
dpkg.log:2018-03-06 15:21:40 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2

Updates clobbered ldap configs and Puppet has attempted to reset some of them at least with some restarts:

syslog:Mar  6 11:09:30 tools-worker-1011 puppet-agent[8829]: (/Stage[main]/Toollabs::Apt_pinning/Apt::Pin[toolforge-libpam-ldapd-pinning]/File[/etc/apt/preferences.d/toolforge_libpam_ldapd_pinning.pref]/ensure) defined content as '{md5}3a070faf67463002c3e503117405666b'
syslog:Mar  6 11:09:30 tools-worker-1011 puppet-agent[8829]: (/Stage[main]/Toollabs::Apt_pinning/Apt::Pin[toolforge-libpam-ldapd-pinning]/File[/etc/apt/preferences.d/toolforge_libpam_ldapd_pinning.pref]) Scheduling refresh of Exec[apt-get update]
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content)
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) --- /etc/nsswitch.conf#0112018-03-06 13:10:34.845544200 +0000
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) +++ /tmp/puppet-file20180306-2559-kbx4bz#0112018-03-06 13:23:36.880551596 +0000
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) @@ -17,5 +17,5 @@
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content)  rpc:            db files
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content)
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content)  netgroup:       ldap
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) +sudoers:        files ldap
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content)  automount:      files ldap
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) -sudoers:#011files ldap
syslog:Mar  6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Filebucketed /etc/nsswitch.conf to puppet with sum 3cf257a629934a708bd2002b0f9f025b
syslog:Mar  6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) content changed '{md5}3cf257a629934a708bd2002b0f9f025b' to '{md5}5a11925d61bd1cec72b9b15f37e13f00'
syslog:Mar  6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Scheduling refresh of Service[nscd]
syslog:Mar  6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Scheduling refresh of Service[nslcd]
syslog:Mar  6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/Service[nscd]) Triggered 'refresh' from 1 events
syslog:Mar  6 13:23:38 tools-worker-1011 systemd[1]: Stopping LSB: LDAP connection daemon...
syslog:Mar  6 13:23:38 tools-worker-1011 nslcd[4055]: Stopping LDAP connection daemon: nslcd.
syslog:Mar  6 13:23:38 tools-worker-1011 systemd[1]: Stopped LSB: LDAP connection daemon.
syslog:Mar  6 13:23:38 tools-worker-1011 systemd[1]: Starting LSB: LDAP connection daemon...
syslog:Mar  6 13:23:43 tools-worker-1011 nslcd[4085]: Starting LDAP connection daemon: nslcd.
syslog:Mar  6 13:23:43 tools-worker-1011 systemd[1]: Started LSB: LDAP connection daemon.
syslog:Mar  6 13:23:43 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/Service[nslcd]) Triggered 'refresh' from 1 events
$ for h in $(seq 1001 1012); do ssh tools-worker-$h.tools.eqiad.wmflabs -- hostname -f; done
tools-worker-1001.tools.eqiad.wmflabs
tools-worker-1002.tools.eqiad.wmflabs
tools-worker-1003.tools.eqiad.wmflabs
tools-worker-1004.tools.eqiad.wmflabs
tools-worker-1005.tools.eqiad.wmflabs
tools-worker-1006.tools.eqiad.wmflabs
tools-worker-1007.tools.eqiad.wmflabs
tools-worker-1008.tools.eqiad.wmflabs
tools-worker-1009.tools.eqiad.wmflabs
tools-worker-1010.tools.eqiad.wmflabs
bd808@tools-worker-1011.tools.eqiad.wmflabs: Permission denied (publickey,hostbased).
tools-worker-1012.tools.eqiad.wmflabs

tools-worker-1011 was having issues allowing non-root logins. I rebooted it:

# Post 1st reboot
root@tools-worker-1011:~# su - madhuvishy
su: Authentication failure
(Ignored)
groups: cannot find name for group ID 500

madhuvishy@tools-worker-1011:~$ id madhuvishy
uid=11511(madhuvishy) gid=500 groups=500
  • Running puppet
madhuvishy@tools-worker-1011:~$ sudo puppet agent -tv
Info: Using configured environment 'future'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-worker-1011.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1520351636'
Notice: Applied catalog in 6.42 seconds

# Trying again. Same error on groups

madhuvishy@tools-worker-1011:~$ id madhuvishy
uid=11511(madhuvishy) gid=500 groups=500

madhuvishy@tools-worker-1011:~$ exit
logout

root@tools-worker-1011:~# su - madhuvishy
su: Authentication failure
(Ignored)
groups: cannot find name for group ID 500
  • Restarting nslcd
madhuvishy@tools-worker-1011:~$ sudo service nslcd restart

# Didn't help either
madhuvishy@tools-worker-1011:~$ id madhuvishy
uid=11511(madhuvishy) gid=500 groups=500

madhuvishy@tools-worker-1011:~$ exit
logout

root@tools-worker-1011:~# su - madhuvishy
groups: cannot find name for group ID 500
  • Restarting nscd
madhuvishy@tools-worker-1011:~$ sudo service nscd restart

# Seems to have fixed
madhuvishy@tools-worker-1011:~$ id madhuvishy
uid=11511(madhuvishy) gid=500(wikidev) groups=1003(wmf),700(ops),50062(project-bastion),50068(project-editor-engagement),50090(project-analytics),50120(project-deployment-prep),50196(project-maps),50254(project-project-proxy),50270(project-reportcard),50302(project-testlabs),50340(project-wikistream),50380(project-tools),50408(project-contributors),50610(project-toolsbeta),50679(project-social-tools),50794(project-language),51697(project-design),51949(project-graphite),52308(project-phabricator),52622(project-ores),52651(project-wikilabels),52708(project-admin),52714(project-art-recs),52777(project-twl),52809(project-dashiki),52815(project-orch),52823(project-wikimetrics),52863(project-paws),52938(project-ores-staging),53013(project-git),53216(project-wikiapiary),53280(project-admin-monitoring),53454(project-pluggableauth),53516(project-webperf),53524(project-download),53572(project-aborrero-test),51051(tools.admin),52490(tools.ifttt),52500(tools.ifttt-dev),52896(tools.wiki-talk),53052(tools.readmore),53053(tools.detox),53324(tools.data-design-demo),53349(tools.openstack-browser),53599(tools.security),500(wikidev)
  • Rebooting again to verify
# Post second reboot - still looks fine
root@tools-worker-1011:~# su - madhuvishy

madhuvishy@tools-worker-1011:~$ id madhuvishy
uid=11511(madhuvishy) gid=500(wikidev) groups=1003(wmf),700(ops),50062(project-bastion),50068(project-editor-engagement),50090(project-analytics),50120(project-deployment-prep),50196(project-maps),50254(project-project-proxy),50270(project-reportcard),50302(project-testlabs),50340(project-wikistream),50380(project-tools),50408(project-contributors),50610(project-toolsbeta),50679(project-social-tools),50794(project-language),51697(project-design),51949(project-graphite),52308(project-phabricator),52622(project-ores),52651(project-wikilabels),52708(project-admin),52714(project-art-recs),52777(project-twl),52809(project-dashiki),52815(project-orch),52823(project-wikimetrics),52863(project-paws),52938(project-ores-staging),53013(project-git),53216(project-wikiapiary),53280(project-admin-monitoring),53454(project-pluggableauth),53516(project-webperf),53524(project-download),53572(project-aborrero-test),51051(tools.admin),52490(tools.ifttt),52500(tools.ifttt-dev),52896(tools.wiki-talk),53052(tools.readmore),53053(tools.detox),53324(tools.data-design-demo),53349(tools.openstack-browser),53599(tools.security),500(wikidev)

Testing to which jessie machines I can SSH and see my home:

arturo@endurance:~ 37s 130 $ for i in $(cat file.txt) ; do ssh $i -- cat .bashrc >/dev/null && echo $i good || echo $i failed ; done
tools-worker-1001.tools.eqiad.wmflabs good
tools-worker-1002.tools.eqiad.wmflabs good
tools-worker-1003.tools.eqiad.wmflabs good
tools-worker-1004.tools.eqiad.wmflabs good
tools-worker-1005.tools.eqiad.wmflabs good
tools-worker-1008.tools.eqiad.wmflabs good
tools-worker-1006.tools.eqiad.wmflabs good
tools-worker-1007.tools.eqiad.wmflabs good
tools-worker-1009.tools.eqiad.wmflabs good
tools-worker-1012.tools.eqiad.wmflabs good
tools-worker-1011.tools.eqiad.wmflabs good
tools-worker-1010.tools.eqiad.wmflabs good
tools-worker-1014.tools.eqiad.wmflabs good
tools-worker-1013.tools.eqiad.wmflabs good
tools-worker-1017.tools.eqiad.wmflabs good
tools-worker-1016.tools.eqiad.wmflabs good
tools-worker-1015.tools.eqiad.wmflabs good
tools-worker-1018.tools.eqiad.wmflabs good
tools-worker-1019.tools.eqiad.wmflabs good
tools-worker-1020.tools.eqiad.wmflabs good
tools-worker-1021.tools.eqiad.wmflabs good
tools-worker-1022.tools.eqiad.wmflabs good
tools-worker-1023.tools.eqiad.wmflabs good
tools-worker-1025.tools.eqiad.wmflabs good
tools-worker-1026.tools.eqiad.wmflabs good
tools-worker-1027.tools.eqiad.wmflabs good
tools-clushmaster-01.tools.eqiad.wmflabs good
tools-docker-builder-05.tools.eqiad.wmflabs good
tools-docker-registry-01.tools.eqiad.wmflabs good
tools-prometheus-01.tools.eqiad.wmflabs good

bash: /home/aborrero/.bashrc: Operation not permitted
cat: .bashrc: Operation not permitted
tools-docker-registry-02.tools.eqiad.wmflabs failed

tools-elastic-01.tools.eqiad.wmflabs good
tools-prometheus-02.tools.eqiad.wmflabs good
tools-proxy-01.tools.eqiad.wmflabs good
tools-proxy-02.tools.eqiad.wmflabs good
tools-elastic-03.tools.eqiad.wmflabs good
tools-elastic-02.tools.eqiad.wmflabs good
tools-redis-1001.tools.eqiad.wmflabs good
tools-redis-1002.tools.eqiad.wmflabs good
tools-flannel-etcd-01.tools.eqiad.wmflabs good
tools-flannel-etcd-02.tools.eqiad.wmflabs good
tools-flannel-etcd-03.tools.eqiad.wmflabs good
tools-logs-02.tools.eqiad.wmflabs good
tools-k8s-etcd-01.tools.eqiad.wmflabs good
tools-k8s-etcd-02.tools.eqiad.wmflabs good
tools-k8s-etcd-03.tools.eqiad.wmflabs good
tools-package-builder-01.tools.eqiad.wmflabs good
tools-k8s-master-01.tools.eqiad.wmflabs good

The list of jessie nodes:

arturo@endurance:~ 1m2s $ cat file.txt 
tools-worker-1001.tools.eqiad.wmflabs
tools-worker-1002.tools.eqiad.wmflabs
tools-worker-1003.tools.eqiad.wmflabs
tools-worker-1004.tools.eqiad.wmflabs
tools-worker-1005.tools.eqiad.wmflabs
tools-worker-1008.tools.eqiad.wmflabs
tools-worker-1006.tools.eqiad.wmflabs
tools-worker-1007.tools.eqiad.wmflabs
tools-worker-1009.tools.eqiad.wmflabs
tools-worker-1012.tools.eqiad.wmflabs
tools-worker-1011.tools.eqiad.wmflabs
tools-worker-1010.tools.eqiad.wmflabs
tools-worker-1014.tools.eqiad.wmflabs
tools-worker-1013.tools.eqiad.wmflabs
tools-worker-1017.tools.eqiad.wmflabs
tools-worker-1016.tools.eqiad.wmflabs
tools-worker-1015.tools.eqiad.wmflabs
tools-worker-1018.tools.eqiad.wmflabs
tools-worker-1019.tools.eqiad.wmflabs
tools-worker-1020.tools.eqiad.wmflabs
tools-worker-1021.tools.eqiad.wmflabs
tools-worker-1022.tools.eqiad.wmflabs
tools-worker-1023.tools.eqiad.wmflabs
tools-worker-1025.tools.eqiad.wmflabs
tools-worker-1026.tools.eqiad.wmflabs
tools-worker-1027.tools.eqiad.wmflabs
tools-clushmaster-01.tools.eqiad.wmflabs
tools-docker-builder-05.tools.eqiad.wmflabs
tools-docker-registry-01.tools.eqiad.wmflabs
tools-prometheus-01.tools.eqiad.wmflabs
tools-docker-registry-02.tools.eqiad.wmflabs
tools-elastic-01.tools.eqiad.wmflabs
tools-prometheus-02.tools.eqiad.wmflabs
tools-proxy-01.tools.eqiad.wmflabs
tools-proxy-02.tools.eqiad.wmflabs
tools-elastic-03.tools.eqiad.wmflabs
tools-elastic-02.tools.eqiad.wmflabs
tools-redis-1001.tools.eqiad.wmflabs
tools-redis-1002.tools.eqiad.wmflabs
tools-flannel-etcd-01.tools.eqiad.wmflabs
tools-flannel-etcd-02.tools.eqiad.wmflabs
tools-flannel-etcd-03.tools.eqiad.wmflabs
tools-logs-02.tools.eqiad.wmflabs
tools-k8s-etcd-01.tools.eqiad.wmflabs
tools-k8s-etcd-02.tools.eqiad.wmflabs
tools-k8s-etcd-03.tools.eqiad.wmflabs
tools-package-builder-01.tools.eqiad.wmflabs
tools-k8s-master-01.tools.eqiad.wmflabs

I will flush the nscd cache in tools-docker-registry-02

tools-worker-1011 was having issues allowing non-root logins. I rebooted it:

The multiple nscd restarts may be a red herring here. Our /etc/nscd.conf config specifies a 60 second negative TTL for cached group lookups. It could have all been wall clock for nscd to decide to talk to LDAP again.

Mentioned in SAL (#wikimedia-cloud) [2018-03-06T16:15:01Z] <madhuvishy> Reboot tools-docker-registry-02 T189018

Tested again, after madhu's operations:

⏚ arturo@endurance:~ 3m7s $ for i in $(cat file.txt) ; do ssh $i -- cat .bashrc >/dev/null && echo $i good || echo $i failed ; done
tools-worker-1001.tools.eqiad.wmflabs good
tools-worker-1002.tools.eqiad.wmflabs good
tools-worker-1003.tools.eqiad.wmflabs good
tools-worker-1004.tools.eqiad.wmflabs good
tools-worker-1005.tools.eqiad.wmflabs good
tools-worker-1008.tools.eqiad.wmflabs good
tools-worker-1006.tools.eqiad.wmflabs good
tools-worker-1007.tools.eqiad.wmflabs good
tools-worker-1009.tools.eqiad.wmflabs good
tools-worker-1012.tools.eqiad.wmflabs good
tools-worker-1011.tools.eqiad.wmflabs good
tools-worker-1010.tools.eqiad.wmflabs good
tools-worker-1014.tools.eqiad.wmflabs good
tools-worker-1013.tools.eqiad.wmflabs good
tools-worker-1017.tools.eqiad.wmflabs good
tools-worker-1016.tools.eqiad.wmflabs good
tools-worker-1015.tools.eqiad.wmflabs good
tools-worker-1018.tools.eqiad.wmflabs good
tools-worker-1019.tools.eqiad.wmflabs good
tools-worker-1020.tools.eqiad.wmflabs good
tools-worker-1021.tools.eqiad.wmflabs good
tools-worker-1022.tools.eqiad.wmflabs good
tools-worker-1023.tools.eqiad.wmflabs good
tools-worker-1025.tools.eqiad.wmflabs good
tools-worker-1026.tools.eqiad.wmflabs good
tools-worker-1027.tools.eqiad.wmflabs good
tools-clushmaster-01.tools.eqiad.wmflabs good
tools-docker-builder-05.tools.eqiad.wmflabs good
tools-docker-registry-01.tools.eqiad.wmflabs good
tools-prometheus-01.tools.eqiad.wmflabs good
tools-docker-registry-02.tools.eqiad.wmflabs good
tools-elastic-01.tools.eqiad.wmflabs good
tools-prometheus-02.tools.eqiad.wmflabs good
tools-proxy-01.tools.eqiad.wmflabs good
tools-proxy-02.tools.eqiad.wmflabs good
tools-elastic-03.tools.eqiad.wmflabs good
tools-elastic-02.tools.eqiad.wmflabs good
tools-redis-1001.tools.eqiad.wmflabs good
tools-redis-1002.tools.eqiad.wmflabs good
tools-flannel-etcd-01.tools.eqiad.wmflabs good
tools-flannel-etcd-02.tools.eqiad.wmflabs good
tools-flannel-etcd-03.tools.eqiad.wmflabs good
tools-logs-02.tools.eqiad.wmflabs good
tools-k8s-etcd-01.tools.eqiad.wmflabs good
tools-k8s-etcd-02.tools.eqiad.wmflabs good
tools-k8s-etcd-03.tools.eqiad.wmflabs good
tools-package-builder-01.tools.eqiad.wmflabs good
tools-k8s-master-01.tools.eqiad.wmflabs good

TLDR: all seems fine

We think things are settled now and that the issue was LDAP configs changing from packages replacing files (like /etc/nsswitch.conf) and that would break LDAP. then Puppet did replace files as far as we can tell and restarted services. But this did not fix things probably due to some negative caching on nscd'd part and also ?.

We are waiting to see if more symptoms emerge, @aborrero is going to look at pinning LDAP/NFS/PAM packages as these are dangerous for us.

chasemp renamed this task from Instances (maybe only Jessie?) are having issues with NFS/LDAP to Toolforge Iinstances (maybe only Jessie?) are having issues with NFS/LDAP.Mar 6 2018, 8:24 PM

Change 416934 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: pin more nss/ldap/pam packages

https://gerrit.wikimedia.org/r/416934

Change 416934 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: pin more nss/ldap/pam packages

https://gerrit.wikimedia.org/r/416934

Change 416943 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: add NFS package pinning

https://gerrit.wikimedia.org/r/416943

Change 416943 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: add NFS package pinning

https://gerrit.wikimedia.org/r/416943

Change 416988 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrades: avoid debconf prompts

https://gerrit.wikimedia.org/r/416988

Change 416988 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrades: avoid debconf prompts

https://gerrit.wikimedia.org/r/416988

aborrero claimed this task.

I think we can close this.

tools-worker-1008 is having this issue on scratch mount right now:

root@tools-worker-1008:~# head -c 1 /data/project/video2commons/video2commons/frontend/static/uploads/0041ea2a-0b78-11e8-bde7-0242c0a8c902 
head: cannot open ‘/data/project/video2commons/video2commons/frontend/static/uploads/0041ea2a-0b78-11e8-bde7-0242c0a8c902’ for reading: Operation not permitted
root@tools-worker-1008:~# ls /data/project/video2commons/video2commons/frontend/static/uploads -l
lrwxrwxrwx 1 tools.video2commons tools.video2commons 36 Dec 26  2016 /data/project/video2commons/video2commons/frontend/static/uploads -> /data/scratch/video2commons/uploads/

(a user of v2c bugged me about it)

Looks like all of kind of instances, whether trusty, jessie, or stretch, can be affected for the scratch mount: P6836 P6837

Same message in dmesg as the time when I first observed this error on the video project.

root@tools-bastion-05:~# dmesg | tail
[...]
[4751186.193719] NFS: Server labstore1003.eqiad.wmnet reports our clientid is in use
[4751186.193845] NFS: state manager: lease expired failed on NFSv4 server labstore1003.eqiad.wmnet with error 1

Checked, all /home mounts are okay.

previously this was LDAP libraries being changed I believe. the current behavior is seemingly different:

root@tools-bastion-05:~# cat /data/scratch/testfile
cat: /data/scratch/testfile: Operation not permitted
root@tools-bastion-05:~#
root@tools-bastion-05:~# df -Th
Filesystem                                               Type      Size  Used Avail Use% Mounted on
udev                                                     devtmpfs  3.9G   12K  3.9G   1% /dev
tmpfs                                                    tmpfs     799M  428K  799M   1% /run
/dev/vda1                                                ext4       18G   12G  5.8G  66% /
none                                                     tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none                                                     tmpfs     5.0M     0  5.0M   0% /run/lock
none                                                     tmpfs     3.9G     0  3.9G   0% /run/shm
none                                                     tmpfs     100M     0  100M   0% /run/user
labstore1003.eqiad.wmnet:/scratch                        nfs4      3.0T  684G  2.2T  24% /mnt/nfs/labstore1003-scratch
labstore1003.eqiad.wmnet:/dumps                          nfs4       28T   21T  7.8T  73% /public/dumps
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home    nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-home
nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-project
root@tools-bastion-05:~# umount -f /data/scratch
root@tools-bastion-05:~# df -Th
Filesystem                                               Type      Size  Used Avail Use% Mounted on
udev                                                     devtmpfs  3.9G   12K  3.9G   1% /dev
tmpfs                                                    tmpfs     799M  428K  799M   1% /run
/dev/vda1                                                ext4       18G   12G  5.8G  66% /
none                                                     tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none                                                     tmpfs     5.0M     0  5.0M   0% /run/lock
none                                                     tmpfs     3.9G     0  3.9G   0% /run/shm
none                                                     tmpfs     100M     0  100M   0% /run/user
labstore1003.eqiad.wmnet:/dumps                          nfs4       28T   21T  7.8T  73% /public/dumps
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home    nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-home
nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-project
root@tools-bastion-05:~# mount -a
mount.nfs: Operation not permitted
root@tools-bastion-05:~#
root@tools-bastion-05:~#
root@tools-bastion-05:~#
root@tools-bastion-05:~# cat /etc/fstab
# HEADER: This file was autogenerated at 2017-03-17 05:49:59 +0000
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
# /etc/fstab: static file system information.
# <file system>                                 <mount point>   <type>  <options>       <dump>  <pass>
proc	/proc	proc	defaults	0	0
/dev/vda1	/	ext4	defaults	0	0
/dev/vda2	swap	swap	defaults	0	0
labstore1003.eqiad.wmnet:/dumps	/public/dumps	nfs	vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,ro,soft,timeo=300,retrans=3	0	0
labstore1003.eqiad.wmnet:/scratch	/mnt/nfs/labstore1003-scratch	nfs	vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,soft,timeo=300,retrans=3,nosuid,noexec,nodev	0	0
nfs-tools-project.svc.eqiad.wmnet:/project/tools/project	/mnt/nfs/labstore-secondary-tools-project	nfs	vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard	0	0
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home	/mnt/nfs/labstore-secondary-tools-home	nfs	vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard	0	0
root@tools-bastion-05:~# df -Th
Filesystem                                               Type      Size  Used Avail Use% Mounted on
udev                                                     devtmpfs  3.9G   12K  3.9G   1% /dev
tmpfs                                                    tmpfs     799M  428K  799M   1% /run
/dev/vda1                                                ext4       18G   12G  5.8G  66% /
none                                                     tmpfs     4.0K     0  4.0K   0% /sys/fs/cgroup
none                                                     tmpfs     5.0M     0  5.0M   0% /run/lock
none                                                     tmpfs     3.9G     0  3.9G   0% /run/shm
none                                                     tmpfs     100M     0  100M   0% /run/user
labstore1003.eqiad.wmnet:/dumps                          nfs4       28T   21T  7.8T  73% /public/dumps
nfs-tools-project.svc.eqiad.wmnet:/project/tools/home    nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-home
nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4      8.0T  5.6T  2.1T  74% /mnt/nfs/labstore-secondary-tools-project
root@tools-bastion-05:~# mount -a
mount.nfs: Operation not permitted
root@tools-bastion-05:~# show
showconsolefont  showkey          showmount        showvirtualenv
root@tools-bastion-05:~# showmount -e labstore1003.eqiad.wmnet
Export list for labstore1003.eqiad.wmnet:
/srv/scratch    *
/srv/statistics *
/srv            *
/srv/maps       10.68.20.112,10.68.16.70,10.68.16.103,10.68.17.110,10.68.16.6
/srv/dumps      (everyone)
root@tools-bastion-05:~#

Seems to make no sense. Note the same file I can cat from bastion-03:

root@tools-bastion-03:/data/scratch# cat testfile
testfile march 12

We restarted the nfs server on labstore1003 and that w/ umount and mount -a for scratch on hosts resolved this. definitely an issue related to the NFS server in combination with various client behaviors in dealing with said issue.

I was looking at dumps just now and...

tools.yifeibot@tools-bastion-02:~$ zcat /public/dumps/public/commonswiki/20180301/commonswiki-20180301-langlinks.sql.gz | head -c 1000
gzip: /public/dumps/public/commonswiki/20180301/commonswiki-20180301-langlinks.sql.gz: Operation not permitted

scratch is okay since the last remount, dumps is the other mount from labstore1003.eqiad.wmnet and apparently I forgot about it. Remount dumps?

Mentioned in SAL (#wikimedia-cloud) [2018-03-20T08:28:26Z] <zhuyifei1999_> unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126