Page MenuHomePhabricator

Investigate broken puppet on Beta Cluster
Closed, ResolvedPublic

Description

Puppet on Beta Cluster broke sometime in the last week-ish; let's figure out where and fix it :)


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=67333

Details

Reference
bz67349

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:40 AM
bzimport set Reference to bz67349.

Report on beta puppet run status:

  • ssh deployment-salt.eqiad.wmflabs
  • sudo salt '*' cmd.run '(hostname; date -d @$(grep last_run /var/lib/puppet/state/last_run_summary.yaml | awk "{print \$2}") +%Y-%m-%dT%H:%M:%S; grep status: /var/lib/puppet/state/last_run_report.yaml | head -1 | awk "{print \$2}")|tr -s [:space:] "\t"' | grep deployment- | sort

Hosts currently failing:

  • deployment-bastion
  • deployment-graphite
  • deployment-jobrunner01
  • deployment-logstash1
  • deployment-videoscaler01

Several of the failures reported above were caused by the local hacks that had been put in place to get puppet to run on deployment-apache0[12]. I have stashed those changes and am forcing a puppet run across the cluster to get a more accurate report.

With the local hacks removed and a bad lock file deleted on deployment-jobrunner01, all hosts in beta are now reporting their latest puppet run as successful:

deployment-analytics01      2014-07-01T23:07:07     success
deployment-apache01         2014-07-01T23:09:50     success
deployment-apache02         2014-07-01T23:10:51     success
deployment-bastion          2014-07-01T23:17:13     success
deployment-cache-bits01     2014-07-01T23:07:58     success
deployment-cache-mobile03   2014-07-01T23:16:54     success
deployment-cache-text02     2014-07-01T22:59:29     success
deployment-cache-upload02   2014-07-01T23:11:50     success
deployment-db1              2014-07-01T23:13:30     success
deployment-elastic01        2014-07-01T23:16:40     success
deployment-elastic02        2014-07-01T23:12:00     success
deployment-elastic03        2014-07-01T23:03:58     success
deployment-elastic04        2014-07-01T23:17:53     success
deployment-eventlogging02   2014-07-01T23:03:34     success
deployment-fluoride         2014-07-01T23:04:05     success
deployment-graphite         2014-07-01T23:18:06     success
deployment-jobrunner01      2014-07-01T23:15:19     success
deployment-logstash1        2014-07-01T23:13:50     success
deployment-memc02           2014-07-01T23:05:08     success
deployment-memc04           2014-07-01T23:08:29     success
deployment-memc05           2014-07-01T23:08:14     success
deployment-parsoid04        2014-07-01T22:59:41     success
deployment-parsoidcache01   2014-06-21T16:18:31     success
deployment-pdf01            2014-07-01T23:02:53     success
deployment-pdf01            2014-07-01T23:02:53     success
deployment-redis01          2014-07-01T23:02:08     success
deployment-rsync01          2014-07-01T23:10:48     success
deployment-salt             2014-07-01T23:10:50     changed
deployment-stream           2014-07-01T23:10:54     success
deployment-stream           2014-07-01T23:10:54     success
deployment-upload           2014-07-01T23:03:05     success
deployment-videoscaler01    2014-07-01T23:06:39     success

Many thanks to Ori and anyone else who helped get this fixed.

Well done!

Having the issue reported is bug 67333 - Yell loudly of failed puppet runs on Beta Cluster instances

Which depends on Bug 63296 - puppet labsstatus not reported when using role::puppet::self