Page MenuHomePhabricator

Puppet not running on tools-webgrid-lighttpd-1207
Closed, ResolvedPublic

Description

PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0]

The last run was last friday, according to the process accounting file:

puppet          |v3|    38.00|     8.00|    51.00|     0|     0| 95808.00|     0.00|     450      447|Fri Aug 12 20:05:57 2016
puppet          |v3|     3.00|     4.00|     8.00|     0|     0|  4396.00|     0.00|    2127      485|Fri Aug 12 20:07:43 2016
puppet          |v3|  1546.00|   401.00| 14259.00|     0|     0|192000.00|     0.00|     485      484|Fri Aug 12 20:06:03 2016
puppet-run      |v3|     0.00|     0.00| 14806.00|     0|     0| 36272.00|     0.00|     447      411|Fri Aug 12 20:05:57 2016
puppet-enabled  |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3544     3543|Wed Aug 17 08:14:11 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3543     3539|Wed Aug 17 08:14:11 2016
97-last-puppet- |v3|     0.00|     0.00|     1.00|     0|     0|  4396.00|     0.00|    3539     3532|Wed Aug 17 08:14:11 2016
puppet-enabled  |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3580     3579|Wed Aug 17 08:14:43 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3579     3575|Wed Aug 17 08:14:43 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3575     3568|Wed Aug 17 08:14:43 2016
puppet-enabled  |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3764     3763|Wed Aug 17 08:23:49 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3763     3759|Wed Aug 17 08:23:49 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3759     3752|Wed Aug 17 08:23:49 2016
puppet-enabled  |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3961     3960|Wed Aug 17 08:25:38 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3960     3956|Wed Aug 17 08:25:38 2016
97-last-puppet- |v3|     0.00|     0.00|     0.00|     0|     0|  4396.00|     0.00|    3956     3949|Wed Aug 17 08:25:38 2016

but puppet seems to run very irregularly:

puppet          |v3|  1275.00|   274.00| 11706.00|     0|     0|192640.00|     0.00|    9453     9452|Tue Aug  9 17:24:08 2016
puppet          |v3|    40.00|     6.00|    49.00|     0|     0| 95808.00|     0.00|   12053    12050|Tue Aug  9 17:54:02 2016
puppet          |v3|    14.00|     7.00|    22.00|     0|     0|  4396.00|     0.00|   13600    12090|Tue Aug  9 17:55:26 2016
puppet          |v3|  1222.00|   273.00| 10058.00|     0|     0|192640.00|     0.00|   12090    12089|Tue Aug  9 17:54:09 2016
puppet          |v3|    38.00|     9.00|    50.00|     0|     0| 95808.00|     0.00|   14733    14730|Tue Aug  9 18:24:01 2016
puppet          |v3|     4.00|     2.00|     7.00|     0|     0|  4396.00|     0.00|   16273    14760|Tue Aug  9 18:25:48 2016
puppet          |v3|  1222.00|   284.00| 12268.00|     0|     0|192192.00|     0.00|   14760    14759|Tue Aug  9 18:24:07 2016
puppet          |v3|    15.00|     8.00|    25.00|     0|     0|  4396.00|     0.00|   30221    28724|Thu Aug 11 18:07:43 2016
puppet          |v3|  1409.00|   374.00|  8695.00|     0|     0|189568.00|     0.00|   28724    28723|Thu Aug 11 18:06:40 2016
puppet          |v3|     2.00|     6.00|     9.00|     0|     0|  4396.00|     0.00|   27439    25940|Fri Aug 12 17:42:28 2016
puppet          |v3|  1407.00|   402.00|  8517.00|     0|     0|190208.00|     0.00|   25940    25938|Fri Aug 12 17:41:25 2016
puppet          |v3|   202.00|    13.00|   280.00|     0|     0|118528.00|     0.00|   28793    28792|Fri Aug 12 18:07:16 2016
puppet          |v3|    38.00|     8.00|    51.00|     0|     0| 95808.00|     0.00|     450      447|Fri Aug 12 20:05:57 2016
puppet          |v3|     3.00|     4.00|     8.00|     0|     0|  4396.00|     0.00|    2127      485|Fri Aug 12 20:07:43 2016
puppet          |v3|  1546.00|   401.00| 14259.00|     0|     0|192000.00|     0.00|     485      484|Fri Aug 12 20:06:03 2016

The last cron-triggered event seems to have been Tue Aug 9 18:24:07 2016, and the other runs are manual invocations:

Aug 11 18:06:39 tools-webgrid-lighttpd-1207 sudo: yuvipanda : TTY=unknown ; PWD=/home/yuvipanda ; USER=root ; COMMAND=/usr/bin/puppet agent -tv
Aug 12 17:41:25 tools-webgrid-lighttpd-1207 sudo: yuvipanda : TTY=unknown ; PWD=/home/yuvipanda ; USER=root ; COMMAND=/usr/bin/puppet agent -tv
Aug 12 18:07:15 tools-webgrid-lighttpd-1207 sudo: yuvipanda : TTY=unknown ; PWD=/home/yuvipanda ; USER=root ; COMMAND=/usr/bin/puppet agent -tv

Overall cron doesn't seem to be running -- the last

Aug  9 18:25:01 tools-webgrid-lighttpd-1207 CRON[14809]: pam_unix(cron:session): session closed for user root
Aug  9 18:26:09 tools-webgrid-lighttpd-1207 CRON[14728]: pam_unix(cron:session): session closed for user root

entries are again on the 9th.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Aha. Seems to be an ldap/networking issue, as around that time:

Aug  9 18:35:01 tools-webgrid-lighttpd-1207 CRON[17040]: Authentication failure
Aug  9 18:39:01 tools-webgrid-lighttpd-1207 CRON[17098]: Authentication failure
Aug  9 18:45:01 tools-webgrid-lighttpd-1207 CRON[17260]: Authentication failure
Aug  9 18:54:01 tools-webgrid-lighttpd-1207 CRON[17388]: Authentication failure

and a few days later

Aug 15 12:51:47 tools-webgrid-lighttpd-1207 nslcd[1077]: [8d2a0b] <group/member="puppet"> ldap_result() failed: Can't contact LDAP server
Aug 15 12:53:48 tools-webgrid-lighttpd-1207 nslcd[1077]: [8c9d33] <group/member="root"> ldap_result() failed: Can't contact LDAP server

these errors persisted until today. I've restarted cron:

service cron restart

but the issues persist:

Aug 17 08:39:01 tools-webgrid-lighttpd-1207 CRON[4613]: Authentication failure
Aug 17 08:41:50 tools-webgrid-lighttpd-1207 nslcd[1077]: [8fc78e] <group=550> error writing to client: Broken pipe
Aug 17 08:41:50 tools-webgrid-lighttpd-1207 nslcd[1077]: [0767e0] <group=550> error writing to client: Broken pipe

so instead I'm going to reschedule jobs on that host and reboot it.

But still:

Aug 17 08:45:33 tools-webgrid-lighttpd-1207 kernel: [    0.000000] Initializing cgroup subsys cpuset
...
Aug 17 08:45:33 tools-webgrid-lighttpd-1207 cron[930]: (CRON) INFO (pidfile fd = 3)
Aug 17 08:45:33 tools-webgrid-lighttpd-1207 cron[948]: (CRON) STARTUP (fork ok)
Aug 17 08:45:33 tools-webgrid-lighttpd-1207 cron[948]: (CRON) INFO (Running @reboot jobs)
Aug 17 08:45:33 tools-webgrid-lighttpd-1207 CRON[955]: Authentication failure
...
Aug 17 08:45:49 tools-webgrid-lighttpd-1207 nslcd[1088]: [16231b] <group=50380> error writing to client: Broken pipe
Aug 17 08:45:49 tools-webgrid-lighttpd-1207 nslcd[1088]: [16e9e8] <group=50380> error writing to client: Broken pipe

Time for a rebuild?

I have disabled all queues on this host for now.

Note: I ran puppet manually for T142932, so although the warning is gone (for now), the issue is not resolved!

The nslcd one is a red herring I think, has been happening on all hosts since the move to openldap (but I think @MoritzMuehlenhoff told us it doesn't matter?)

You mean the "error writing to client: Broken pipe" message? Yes, that's irritatiing, but benign. See here for more:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=685504
https://access.redhat.com/solutions/58684

We also have T78616 about this.

This seems to be related to the root password that was introduced a week or so ago. On hosts that are OK, /etc/shadow looks like this:

valhallasw@tools-webgrid-lighttpd-1206:~$ sudo head /etc/shadow
root:!:17024:0:99999:7:::

i.e. no password. On this host (and tools-exec-1213, which has the same issue), the first entry looks like this:

valhallasw@tools-webgrid-lighttpd-1207:~$ sudo head /etc/shadow
root:$6$[data]
:17022:0:99999:7:::
daemon:*:16469:0:99999:7:::

i.e. there is a newline there that breaks the /etc/passwd format. This means e.g. sudo su errors out, which likely also breaks cron.

This seems to be a remainder of a half-applied puppet change. The solution is simple -- remove the newline, and puppet will clean up the shadow file on the first run.

Mentioned in SAL [2016-08-30T18:54:00Z] <valhallasw`cloud> edited /etc/shadow on a range of hosts to fix https://phabricator.wikimedia.org/T143191

valhallasw claimed this task.

did this on:

  • tools-webgrid-lighttpd-1207
  • tools-webgrid-lighttpd-1208
  • tools-exec-1204
  • tools-exec-1211
  • tools-exec-1213

Those were all hosts with a high (>30 min) last-puppet-runtime.