Description
Details
Related Objects
- Mentioned In
- T190126: Unable to read dumps
T187193: toolforge: pin key packages by version
T188998: Connection Error at OrphanTalk Tools - Mentioned Here
- T190126: Unable to read dumps
P6836 (An Untitled Masterwork)
P6837 P6836 filtered
T188994: toolforge: package upgrades as part of the new workflow
T188998: Connection Error at OrphanTalk Tools
T189001: Global User Contributions complains about replica conf file
Event Timeline
current situation is we drained and rebooted 1001-1012, but the problems reported in other tasks effected nodes outside of that range (1020). guc which was restarted via 'webservice restart --backend=kubernetes' seems to be running fine now on 1019.
root@tools-worker-1004:~# su - madhuvishy -su: /home/madhuvishy/.bash_profile: Operation not permitted madhuvishy@tools-worker-1004:~$
replag tool was also reporting it could not read replica.my.cnf and a restart of the webservice seems to have brought it back online at 1007
bd808> !log tools.orphantalk Restarting webservice (T188998)
chasemp> bd808: did that start working post restart and if so on what worker?
bd808> chasemp: yes, it is working now and its on ... tools-worker-1003.tools.eqiad.wmflabs
bd808> it was on tools-worker-1013.tools.eqiad.wmflabs before restart
chasemp> bd808: ok ack and that new one is one of the newly rebooted
Today I was doing package upgrades on jessie nodes in toolforge, all recorded in SAL and in T188994.
Upgraded packages are:
aborrero@tools-worker-1020:~$ tail /var/log/apt/history.log | grep Upgrade | sed s/"Upgrade: "//g | sed s/"), "/")"'\n'/g | sort base-files:amd64 (8+deb8u5, 8+deb8u10) bash:amd64 (4.3-11+b1, 4.3-11+deb8u1) binutils:amd64 (2.25-5, 2.25-5+deb8u1) ca-certificates:amd64 (20141019+deb8u1, 20141019+deb8u3) dbus:amd64 (1.8.20-0+deb8u1, 1.8.22-0+deb8u1) debconf:amd64 (1.5.56, 1.5.56+deb8u1) debconf-i18n:amd64 (1.5.56, 1.5.56+deb8u1) debian-archive-keyring:amd64 (2014.3, 2017.5~deb8u1) e2fslibs:amd64 (1.42.12-1.1, 1.42.12-2+b1) e2fsprogs:amd64 (1.42.12-1.1, 1.42.12-2+b1) file:amd64 (5.22+15-2+deb8u1, 5.22+15-2+deb8u3) gnupg2:amd64 (2.0.26-6, 2.0.26-6+deb8u1) gnupg-agent:amd64 (2.0.26-6, 2.0.26-6+deb8u1) initramfs-tools:amd64 (0.120+deb8u2, 0.120+deb8u3) jq:amd64 (1.4-2.1, 1.4-2.1+deb8u1) krb5-locales:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4) libcairo2:amd64 (1.14.0-2.1+deb8u1, 1.14.0-2.1+deb8u2) libcairo-gobject2:amd64 (1.14.0-2.1+deb8u1, 1.14.0-2.1+deb8u2) libc-ares2:amd64 (1.10.0-2+deb8u1, 1.10.0-2+deb8u2) libcomerr2:amd64 (1.42.12-1.1, 1.42.12-2+b1) libcups2:amd64 (1.7.5-11+deb8u1, 1.7.5-11+deb8u2) libdb5.3:amd64 (5.3.28-9, 5.3.28-9+deb8u1) libdbus-1-3:amd64 (1.8.20-0+deb8u1, 1.8.22-0+deb8u1) libgnutls-deb0-28:amd64 (3.3.8-6+deb8u6, 3.3.8-6+deb8u7) libgnutls-openssl27:amd64 (3.3.8-6+deb8u6, 3.3.8-6+deb8u7) libgssapi-krb5-2:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4) libgtk2.0-0:amd64 (2.24.25-3+deb8u1, 2.24.25-3+deb8u2) libgtk2.0-common:amd64 (2.24.25-3+deb8u1, 2.24.25-3+deb8u2) libhogweed2:amd64 (2.7.1-5+deb8u1, 2.7.1-5+deb8u2) libicu52:amd64 (52.1-8+deb8u5, 52.1-8+deb8u6) libk5crypto3:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4) libkrb5-3:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4) libkrb5support0:amd64 (1.12.1+dfsg-19+deb8u2, 1.12.1+dfsg-19+deb8u4) libltdl7:amd64 (2.4.2-1.11, 2.4.2-1.11+b1) libmagic1:amd64 (5.22+15-2+deb8u1, 5.22+15-2+deb8u3) libncurses5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2) libncursesw5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2) libnettle4:amd64 (2.7.1-5+deb8u1, 2.7.1-5+deb8u2) libnss-ldapd:amd64 (0.9.4-3+deb8u1, 0.9.4-3+deb8u2) libpng12-0:amd64 (1.2.50-2+deb8u2, 1.2.50-2+deb8u3) libpython2.7:amd64 (2.7.9-2, 2.7.9-2+deb8u1) libpython2.7-minimal:amd64 (2.7.9-2, 2.7.9-2+deb8u1) libpython2.7-stdlib:amd64 (2.7.9-2, 2.7.9-2+deb8u1) libruby2.1:amd64 (2.1.5-2+deb8u2, 2.1.5-2+deb8u3) libsqlite3-0:amd64 (3.8.7.1-1+deb8u1, 3.8.7.1-1+deb8u2) libss2:amd64 (1.42.12-1.1, 1.42.12-2+b1) libsystemd0:amd64 (215-17+deb8u4, 215-17+deb8u7) libtinfo5:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2) libudev1:amd64 (215-17+deb8u4, 215-17+deb8u7) libx11-6:amd64 (1.6.2-3, 1.6.2-3+deb8u1) libx11-data:amd64 (1.6.2-3, 1.6.2-3+deb8u1) libxfixes3:amd64 (5.0.1-2+b2, 5.0.1-2+deb8u1) libxi6:amd64 (1.7.4-1+b2, 1.7.4-1+deb8u1) libxrandr2:amd64 (1.4.2-1+b1, 1.4.2-1+deb8u1) libxslt1.1:amd64 (1.1.28-2+deb8u2, 1.1.28-2+deb8u3) ncurses-base:amd64 (5.9+20140913-1, 5.9+20140913-1+deb8u2) ncurses-bin:amd64 (5.9+20140913-1+b1, 5.9+20140913-1+deb8u2) ncurses-term:amd64 (5.9+20140913-1, 5.9+20140913-1+deb8u2) openssh-client:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4) openssh-server:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4) openssh-sftp-server:amd64 (6.7p1-5+deb8u3, 6.7p1-5+deb8u4) python2.7:amd64 (2.7.9-2, 2.7.9-2+deb8u1) python2.7-minimal:amd64 (2.7.9-2, 2.7.9-2+deb8u1) python-crypto:amd64 (2.6.1-5+b2, 2.6.1-5+deb8u1) ruby2.1:amd64 (2.1.5-2+deb8u2, 2.1.5-2+deb8u3) sed:amd64 (4.2.2-4+b1, 4.2.2-4+deb8u1) sudo-ldap:amd64 (1.8.10p3-1+deb8u4, 1.8.10p3-1+deb8u5) systemd:amd64 (215-17+deb8u4, 215-17+deb8u7) systemd-sysv:amd64 (215-17+deb8u4, 215-17+deb8u7) udev:amd64 (215-17+deb8u4, 215-17+deb8u7) vim:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3) vim-common:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3) vim-runtime:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3) vim-tiny:amd64 (7.4.488-7+deb8u2, 7.4.488-7+deb8u3) w3m:amd64 (0.5.3-19, 0.5.3-19+deb8u2)
During the operation of upgrade, an issue happened. In some servers (not all) there was some race in the dpkg lock between apt-upgrade and puppet.
Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. I had to manually kill dpkg and resume config using something like this (clush @ jessie):
sudo pkill dpkg ; sudo DEBIAN_FRONTEND=noninteractive dpkg --configure -a
The libnss-ldap package has a debconf prompt, which was involved.
After configuration, I finished the upgrades in the nodes that were left behind due to the previous error, until all jessie nodes were upgraded.
dpkg.log:2018-03-06 13:10:08 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u1 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:08 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u1 dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u1 dpkg.log:2018-03-06 13:10:08 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u1 dpkg.log:2018-03-06 13:10:08 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u1 dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:08 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:17 upgrade sudo-ldap:amd64 1.8.10p3-1+deb8u4 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:17 status half-configured sudo-ldap:amd64 1.8.10p3-1+deb8u4 dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u4 dpkg.log:2018-03-06 13:10:17 status half-installed sudo-ldap:amd64 1.8.10p3-1+deb8u4 dpkg.log:2018-03-06 13:10:17 status half-installed sudo-ldap:amd64 1.8.10p3-1+deb8u4 dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:17 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:23 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none> dpkg.log:2018-03-06 13:10:23 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:23 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:24 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 13:10:34 configure sudo-ldap:amd64 1.8.10p3-1+deb8u5 <none> dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status unpacked sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status half-configured sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 13:10:34 status installed sudo-ldap:amd64 1.8.10p3-1+deb8u5 dpkg.log:2018-03-06 15:21:14 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u2 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none> dpkg.log:2018-03-06 15:21:14 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:14 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:15 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 upgrade libnss-ldapd:amd64 0.9.4-3+deb8u2 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status half-installed libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:39 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:40 configure libnss-ldapd:amd64 0.9.4-3+deb8u2 <none> dpkg.log:2018-03-06 15:21:40 status unpacked libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:40 status half-configured libnss-ldapd:amd64 0.9.4-3+deb8u2 dpkg.log:2018-03-06 15:21:40 status installed libnss-ldapd:amd64 0.9.4-3+deb8u2
Updates clobbered ldap configs and Puppet has attempted to reset some of them at least with some restarts:
syslog:Mar 6 11:09:30 tools-worker-1011 puppet-agent[8829]: (/Stage[main]/Toollabs::Apt_pinning/Apt::Pin[toolforge-libpam-ldapd-pinning]/File[/etc/apt/preferences.d/toolforge_libpam_ldapd_pinning.pref]/ensure) defined content as '{md5}3a070faf67463002c3e503117405666b' syslog:Mar 6 11:09:30 tools-worker-1011 puppet-agent[8829]: (/Stage[main]/Toollabs::Apt_pinning/Apt::Pin[toolforge-libpam-ldapd-pinning]/File[/etc/apt/preferences.d/toolforge_libpam_ldapd_pinning.pref]) Scheduling refresh of Exec[apt-get update] syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) --- /etc/nsswitch.conf#0112018-03-06 13:10:34.845544200 +0000 syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) +++ /tmp/puppet-file20180306-2559-kbx4bz#0112018-03-06 13:23:36.880551596 +0000 syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) @@ -17,5 +17,5 @@ syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) rpc: db files syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) netgroup: ldap syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) +sudoers: files ldap syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) automount: files ldap syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) -sudoers:#011files ldap syslog:Mar 6 13:23:36 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Filebucketed /etc/nsswitch.conf to puppet with sum 3cf257a629934a708bd2002b0f9f025b syslog:Mar 6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]/content) content changed '{md5}3cf257a629934a708bd2002b0f9f025b' to '{md5}5a11925d61bd1cec72b9b15f37e13f00' syslog:Mar 6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Scheduling refresh of Service[nscd] syslog:Mar 6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/File[/etc/nsswitch.conf]) Scheduling refresh of Service[nslcd] syslog:Mar 6 13:23:37 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/Service[nscd]) Triggered 'refresh' from 1 events syslog:Mar 6 13:23:38 tools-worker-1011 systemd[1]: Stopping LSB: LDAP connection daemon... syslog:Mar 6 13:23:38 tools-worker-1011 nslcd[4055]: Stopping LDAP connection daemon: nslcd. syslog:Mar 6 13:23:38 tools-worker-1011 systemd[1]: Stopped LSB: LDAP connection daemon. syslog:Mar 6 13:23:38 tools-worker-1011 systemd[1]: Starting LSB: LDAP connection daemon... syslog:Mar 6 13:23:43 tools-worker-1011 nslcd[4085]: Starting LDAP connection daemon: nslcd. syslog:Mar 6 13:23:43 tools-worker-1011 systemd[1]: Started LSB: LDAP connection daemon. syslog:Mar 6 13:23:43 tools-worker-1011 puppet-agent[2559]: (/Stage[main]/Ldap::Client::Nss/Service[nslcd]) Triggered 'refresh' from 1 events
$ for h in $(seq 1001 1012); do ssh tools-worker-$h.tools.eqiad.wmflabs -- hostname -f; done tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs bd808@tools-worker-1011.tools.eqiad.wmflabs: Permission denied (publickey,hostbased). tools-worker-1012.tools.eqiad.wmflabs
tools-worker-1011 was having issues allowing non-root logins. I rebooted it:
# Post 1st reboot root@tools-worker-1011:~# su - madhuvishy su: Authentication failure (Ignored) groups: cannot find name for group ID 500 madhuvishy@tools-worker-1011:~$ id madhuvishy uid=11511(madhuvishy) gid=500 groups=500
- Running puppet
madhuvishy@tools-worker-1011:~$ sudo puppet agent -tv Info: Using configured environment 'future' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for tools-worker-1011.tools.eqiad.wmflabs Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files Info: Applying configuration version '1520351636' Notice: Applied catalog in 6.42 seconds # Trying again. Same error on groups madhuvishy@tools-worker-1011:~$ id madhuvishy uid=11511(madhuvishy) gid=500 groups=500 madhuvishy@tools-worker-1011:~$ exit logout root@tools-worker-1011:~# su - madhuvishy su: Authentication failure (Ignored) groups: cannot find name for group ID 500
- Restarting nslcd
madhuvishy@tools-worker-1011:~$ sudo service nslcd restart # Didn't help either madhuvishy@tools-worker-1011:~$ id madhuvishy uid=11511(madhuvishy) gid=500 groups=500 madhuvishy@tools-worker-1011:~$ exit logout root@tools-worker-1011:~# su - madhuvishy groups: cannot find name for group ID 500
- Restarting nscd
madhuvishy@tools-worker-1011:~$ sudo service nscd restart # Seems to have fixed madhuvishy@tools-worker-1011:~$ id madhuvishy uid=11511(madhuvishy) gid=500(wikidev) groups=1003(wmf),700(ops),50062(project-bastion),50068(project-editor-engagement),50090(project-analytics),50120(project-deployment-prep),50196(project-maps),50254(project-project-proxy),50270(project-reportcard),50302(project-testlabs),50340(project-wikistream),50380(project-tools),50408(project-contributors),50610(project-toolsbeta),50679(project-social-tools),50794(project-language),51697(project-design),51949(project-graphite),52308(project-phabricator),52622(project-ores),52651(project-wikilabels),52708(project-admin),52714(project-art-recs),52777(project-twl),52809(project-dashiki),52815(project-orch),52823(project-wikimetrics),52863(project-paws),52938(project-ores-staging),53013(project-git),53216(project-wikiapiary),53280(project-admin-monitoring),53454(project-pluggableauth),53516(project-webperf),53524(project-download),53572(project-aborrero-test),51051(tools.admin),52490(tools.ifttt),52500(tools.ifttt-dev),52896(tools.wiki-talk),53052(tools.readmore),53053(tools.detox),53324(tools.data-design-demo),53349(tools.openstack-browser),53599(tools.security),500(wikidev)
- Rebooting again to verify
# Post second reboot - still looks fine root@tools-worker-1011:~# su - madhuvishy madhuvishy@tools-worker-1011:~$ id madhuvishy uid=11511(madhuvishy) gid=500(wikidev) groups=1003(wmf),700(ops),50062(project-bastion),50068(project-editor-engagement),50090(project-analytics),50120(project-deployment-prep),50196(project-maps),50254(project-project-proxy),50270(project-reportcard),50302(project-testlabs),50340(project-wikistream),50380(project-tools),50408(project-contributors),50610(project-toolsbeta),50679(project-social-tools),50794(project-language),51697(project-design),51949(project-graphite),52308(project-phabricator),52622(project-ores),52651(project-wikilabels),52708(project-admin),52714(project-art-recs),52777(project-twl),52809(project-dashiki),52815(project-orch),52823(project-wikimetrics),52863(project-paws),52938(project-ores-staging),53013(project-git),53216(project-wikiapiary),53280(project-admin-monitoring),53454(project-pluggableauth),53516(project-webperf),53524(project-download),53572(project-aborrero-test),51051(tools.admin),52490(tools.ifttt),52500(tools.ifttt-dev),52896(tools.wiki-talk),53052(tools.readmore),53053(tools.detox),53324(tools.data-design-demo),53349(tools.openstack-browser),53599(tools.security),500(wikidev)
Testing to which jessie machines I can SSH and see my home:
arturo@endurance:~ 37s 130 $ for i in $(cat file.txt) ; do ssh $i -- cat .bashrc >/dev/null && echo $i good || echo $i failed ; done tools-worker-1001.tools.eqiad.wmflabs good tools-worker-1002.tools.eqiad.wmflabs good tools-worker-1003.tools.eqiad.wmflabs good tools-worker-1004.tools.eqiad.wmflabs good tools-worker-1005.tools.eqiad.wmflabs good tools-worker-1008.tools.eqiad.wmflabs good tools-worker-1006.tools.eqiad.wmflabs good tools-worker-1007.tools.eqiad.wmflabs good tools-worker-1009.tools.eqiad.wmflabs good tools-worker-1012.tools.eqiad.wmflabs good tools-worker-1011.tools.eqiad.wmflabs good tools-worker-1010.tools.eqiad.wmflabs good tools-worker-1014.tools.eqiad.wmflabs good tools-worker-1013.tools.eqiad.wmflabs good tools-worker-1017.tools.eqiad.wmflabs good tools-worker-1016.tools.eqiad.wmflabs good tools-worker-1015.tools.eqiad.wmflabs good tools-worker-1018.tools.eqiad.wmflabs good tools-worker-1019.tools.eqiad.wmflabs good tools-worker-1020.tools.eqiad.wmflabs good tools-worker-1021.tools.eqiad.wmflabs good tools-worker-1022.tools.eqiad.wmflabs good tools-worker-1023.tools.eqiad.wmflabs good tools-worker-1025.tools.eqiad.wmflabs good tools-worker-1026.tools.eqiad.wmflabs good tools-worker-1027.tools.eqiad.wmflabs good tools-clushmaster-01.tools.eqiad.wmflabs good tools-docker-builder-05.tools.eqiad.wmflabs good tools-docker-registry-01.tools.eqiad.wmflabs good tools-prometheus-01.tools.eqiad.wmflabs good bash: /home/aborrero/.bashrc: Operation not permitted cat: .bashrc: Operation not permitted tools-docker-registry-02.tools.eqiad.wmflabs failed tools-elastic-01.tools.eqiad.wmflabs good tools-prometheus-02.tools.eqiad.wmflabs good tools-proxy-01.tools.eqiad.wmflabs good tools-proxy-02.tools.eqiad.wmflabs good tools-elastic-03.tools.eqiad.wmflabs good tools-elastic-02.tools.eqiad.wmflabs good tools-redis-1001.tools.eqiad.wmflabs good tools-redis-1002.tools.eqiad.wmflabs good tools-flannel-etcd-01.tools.eqiad.wmflabs good tools-flannel-etcd-02.tools.eqiad.wmflabs good tools-flannel-etcd-03.tools.eqiad.wmflabs good tools-logs-02.tools.eqiad.wmflabs good tools-k8s-etcd-01.tools.eqiad.wmflabs good tools-k8s-etcd-02.tools.eqiad.wmflabs good tools-k8s-etcd-03.tools.eqiad.wmflabs good tools-package-builder-01.tools.eqiad.wmflabs good tools-k8s-master-01.tools.eqiad.wmflabs good
The list of jessie nodes:
arturo@endurance:~ 1m2s $ cat file.txt tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1011.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1017.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1018.tools.eqiad.wmflabs tools-worker-1019.tools.eqiad.wmflabs tools-worker-1020.tools.eqiad.wmflabs tools-worker-1021.tools.eqiad.wmflabs tools-worker-1022.tools.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-worker-1025.tools.eqiad.wmflabs tools-worker-1026.tools.eqiad.wmflabs tools-worker-1027.tools.eqiad.wmflabs tools-clushmaster-01.tools.eqiad.wmflabs tools-docker-builder-05.tools.eqiad.wmflabs tools-docker-registry-01.tools.eqiad.wmflabs tools-prometheus-01.tools.eqiad.wmflabs tools-docker-registry-02.tools.eqiad.wmflabs tools-elastic-01.tools.eqiad.wmflabs tools-prometheus-02.tools.eqiad.wmflabs tools-proxy-01.tools.eqiad.wmflabs tools-proxy-02.tools.eqiad.wmflabs tools-elastic-03.tools.eqiad.wmflabs tools-elastic-02.tools.eqiad.wmflabs tools-redis-1001.tools.eqiad.wmflabs tools-redis-1002.tools.eqiad.wmflabs tools-flannel-etcd-01.tools.eqiad.wmflabs tools-flannel-etcd-02.tools.eqiad.wmflabs tools-flannel-etcd-03.tools.eqiad.wmflabs tools-logs-02.tools.eqiad.wmflabs tools-k8s-etcd-01.tools.eqiad.wmflabs tools-k8s-etcd-02.tools.eqiad.wmflabs tools-k8s-etcd-03.tools.eqiad.wmflabs tools-package-builder-01.tools.eqiad.wmflabs tools-k8s-master-01.tools.eqiad.wmflabs
I will flush the nscd cache in tools-docker-registry-02
The multiple nscd restarts may be a red herring here. Our /etc/nscd.conf config specifies a 60 second negative TTL for cached group lookups. It could have all been wall clock for nscd to decide to talk to LDAP again.
Mentioned in SAL (#wikimedia-cloud) [2018-03-06T16:15:01Z] <madhuvishy> Reboot tools-docker-registry-02 T189018
Tested again, after madhu's operations:
⏚ arturo@endurance:~ 3m7s $ for i in $(cat file.txt) ; do ssh $i -- cat .bashrc >/dev/null && echo $i good || echo $i failed ; done tools-worker-1001.tools.eqiad.wmflabs good tools-worker-1002.tools.eqiad.wmflabs good tools-worker-1003.tools.eqiad.wmflabs good tools-worker-1004.tools.eqiad.wmflabs good tools-worker-1005.tools.eqiad.wmflabs good tools-worker-1008.tools.eqiad.wmflabs good tools-worker-1006.tools.eqiad.wmflabs good tools-worker-1007.tools.eqiad.wmflabs good tools-worker-1009.tools.eqiad.wmflabs good tools-worker-1012.tools.eqiad.wmflabs good tools-worker-1011.tools.eqiad.wmflabs good tools-worker-1010.tools.eqiad.wmflabs good tools-worker-1014.tools.eqiad.wmflabs good tools-worker-1013.tools.eqiad.wmflabs good tools-worker-1017.tools.eqiad.wmflabs good tools-worker-1016.tools.eqiad.wmflabs good tools-worker-1015.tools.eqiad.wmflabs good tools-worker-1018.tools.eqiad.wmflabs good tools-worker-1019.tools.eqiad.wmflabs good tools-worker-1020.tools.eqiad.wmflabs good tools-worker-1021.tools.eqiad.wmflabs good tools-worker-1022.tools.eqiad.wmflabs good tools-worker-1023.tools.eqiad.wmflabs good tools-worker-1025.tools.eqiad.wmflabs good tools-worker-1026.tools.eqiad.wmflabs good tools-worker-1027.tools.eqiad.wmflabs good tools-clushmaster-01.tools.eqiad.wmflabs good tools-docker-builder-05.tools.eqiad.wmflabs good tools-docker-registry-01.tools.eqiad.wmflabs good tools-prometheus-01.tools.eqiad.wmflabs good tools-docker-registry-02.tools.eqiad.wmflabs good tools-elastic-01.tools.eqiad.wmflabs good tools-prometheus-02.tools.eqiad.wmflabs good tools-proxy-01.tools.eqiad.wmflabs good tools-proxy-02.tools.eqiad.wmflabs good tools-elastic-03.tools.eqiad.wmflabs good tools-elastic-02.tools.eqiad.wmflabs good tools-redis-1001.tools.eqiad.wmflabs good tools-redis-1002.tools.eqiad.wmflabs good tools-flannel-etcd-01.tools.eqiad.wmflabs good tools-flannel-etcd-02.tools.eqiad.wmflabs good tools-flannel-etcd-03.tools.eqiad.wmflabs good tools-logs-02.tools.eqiad.wmflabs good tools-k8s-etcd-01.tools.eqiad.wmflabs good tools-k8s-etcd-02.tools.eqiad.wmflabs good tools-k8s-etcd-03.tools.eqiad.wmflabs good tools-package-builder-01.tools.eqiad.wmflabs good tools-k8s-master-01.tools.eqiad.wmflabs good
TLDR: all seems fine
We think things are settled now and that the issue was LDAP configs changing from packages replacing files (like /etc/nsswitch.conf) and that would break LDAP. then Puppet did replace files as far as we can tell and restarted services. But this did not fix things probably due to some negative caching on nscd'd part and also ?.
We are waiting to see if more symptoms emerge, @aborrero is going to look at pinning LDAP/NFS/PAM packages as these are dangerous for us.
Change 416934 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: pin more nss/ldap/pam packages
Change 416934 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: pin more nss/ldap/pam packages
Change 416943 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: apt_pinning: add NFS package pinning
Change 416943 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: apt_pinning: add NFS package pinning
Change 416988 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] apt: apt-upgrades: avoid debconf prompts
Change 416988 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] apt: apt-upgrades: avoid debconf prompts
tools-worker-1008 is having this issue on scratch mount right now:
root@tools-worker-1008:~# head -c 1 /data/project/video2commons/video2commons/frontend/static/uploads/0041ea2a-0b78-11e8-bde7-0242c0a8c902 head: cannot open ‘/data/project/video2commons/video2commons/frontend/static/uploads/0041ea2a-0b78-11e8-bde7-0242c0a8c902’ for reading: Operation not permitted root@tools-worker-1008:~# ls /data/project/video2commons/video2commons/frontend/static/uploads -l lrwxrwxrwx 1 tools.video2commons tools.video2commons 36 Dec 26 2016 /data/project/video2commons/video2commons/frontend/static/uploads -> /data/scratch/video2commons/uploads/
(a user of v2c bugged me about it)
Same message in dmesg as the time when I first observed this error on the video project.
root@tools-bastion-05:~# dmesg | tail [...] [4751186.193719] NFS: Server labstore1003.eqiad.wmnet reports our clientid is in use [4751186.193845] NFS: state manager: lease expired failed on NFSv4 server labstore1003.eqiad.wmnet with error 1
previously this was LDAP libraries being changed I believe. the current behavior is seemingly different:
root@tools-bastion-05:~# cat /data/scratch/testfile cat: /data/scratch/testfile: Operation not permitted root@tools-bastion-05:~# root@tools-bastion-05:~# df -Th Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 3.9G 12K 3.9G 1% /dev tmpfs tmpfs 799M 428K 799M 1% /run /dev/vda1 ext4 18G 12G 5.8G 66% / none tmpfs 4.0K 0 4.0K 0% /sys/fs/cgroup none tmpfs 5.0M 0 5.0M 0% /run/lock none tmpfs 3.9G 0 3.9G 0% /run/shm none tmpfs 100M 0 100M 0% /run/user labstore1003.eqiad.wmnet:/scratch nfs4 3.0T 684G 2.2T 24% /mnt/nfs/labstore1003-scratch labstore1003.eqiad.wmnet:/dumps nfs4 28T 21T 7.8T 73% /public/dumps nfs-tools-project.svc.eqiad.wmnet:/project/tools/home nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-home nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-project root@tools-bastion-05:~# umount -f /data/scratch root@tools-bastion-05:~# df -Th Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 3.9G 12K 3.9G 1% /dev tmpfs tmpfs 799M 428K 799M 1% /run /dev/vda1 ext4 18G 12G 5.8G 66% / none tmpfs 4.0K 0 4.0K 0% /sys/fs/cgroup none tmpfs 5.0M 0 5.0M 0% /run/lock none tmpfs 3.9G 0 3.9G 0% /run/shm none tmpfs 100M 0 100M 0% /run/user labstore1003.eqiad.wmnet:/dumps nfs4 28T 21T 7.8T 73% /public/dumps nfs-tools-project.svc.eqiad.wmnet:/project/tools/home nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-home nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-project root@tools-bastion-05:~# mount -a mount.nfs: Operation not permitted root@tools-bastion-05:~# root@tools-bastion-05:~# root@tools-bastion-05:~# root@tools-bastion-05:~# cat /etc/fstab # HEADER: This file was autogenerated at 2017-03-17 05:49:59 +0000 # HEADER: by puppet. While it can still be managed manually, it # HEADER: is definitely not recommended. # /etc/fstab: static file system information. # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/vda1 / ext4 defaults 0 0 /dev/vda2 swap swap defaults 0 0 labstore1003.eqiad.wmnet:/dumps /public/dumps nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,ro,soft,timeo=300,retrans=3 0 0 labstore1003.eqiad.wmnet:/scratch /mnt/nfs/labstore1003-scratch nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,soft,timeo=300,retrans=3,nosuid,noexec,nodev 0 0 nfs-tools-project.svc.eqiad.wmnet:/project/tools/project /mnt/nfs/labstore-secondary-tools-project nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard 0 0 nfs-tools-project.svc.eqiad.wmnet:/project/tools/home /mnt/nfs/labstore-secondary-tools-home nfs vers=4,bg,intr,sec=sys,proto=tcp,port=0,noatime,lookupcache=all,nofsc,rw,hard 0 0 root@tools-bastion-05:~# df -Th Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 3.9G 12K 3.9G 1% /dev tmpfs tmpfs 799M 428K 799M 1% /run /dev/vda1 ext4 18G 12G 5.8G 66% / none tmpfs 4.0K 0 4.0K 0% /sys/fs/cgroup none tmpfs 5.0M 0 5.0M 0% /run/lock none tmpfs 3.9G 0 3.9G 0% /run/shm none tmpfs 100M 0 100M 0% /run/user labstore1003.eqiad.wmnet:/dumps nfs4 28T 21T 7.8T 73% /public/dumps nfs-tools-project.svc.eqiad.wmnet:/project/tools/home nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-home nfs-tools-project.svc.eqiad.wmnet:/project/tools/project nfs4 8.0T 5.6T 2.1T 74% /mnt/nfs/labstore-secondary-tools-project root@tools-bastion-05:~# mount -a mount.nfs: Operation not permitted root@tools-bastion-05:~# show showconsolefont showkey showmount showvirtualenv root@tools-bastion-05:~# showmount -e labstore1003.eqiad.wmnet Export list for labstore1003.eqiad.wmnet: /srv/scratch * /srv/statistics * /srv * /srv/maps 10.68.20.112,10.68.16.70,10.68.16.103,10.68.17.110,10.68.16.6 /srv/dumps (everyone) root@tools-bastion-05:~#
Seems to make no sense. Note the same file I can cat from bastion-03:
root@tools-bastion-03:/data/scratch# cat testfile testfile march 12
We restarted the nfs server on labstore1003 and that w/ umount and mount -a for scratch on hosts resolved this. definitely an issue related to the NFS server in combination with various client behaviors in dealing with said issue.
I was looking at dumps just now and...
tools.yifeibot@tools-bastion-02:~$ zcat /public/dumps/public/commonswiki/20180301/commonswiki-20180301-langlinks.sql.gz | head -c 1000 gzip: /public/dumps/public/commonswiki/20180301/commonswiki-20180301-langlinks.sql.gz: Operation not permitted
scratch is okay since the last remount, dumps is the other mount from labstore1003.eqiad.wmnet and apparently I forgot about it. Remount dumps?
Mentioned in SAL (#wikimedia-cloud) [2018-03-20T08:28:26Z] <zhuyifei1999_> unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126