Investigate the cause of puppet failures on Tools
Closed, ResolvedPublic

Description

There have been a large number of puppet failing alerts in tools labs, to the point where the labs admins have learned to essentially ignore them as noise. While there appears to be a number of causes, at least some of the intermitent failures currently have no explanation and recur.

This needs slightly deeper investigation.

coren created this task.Aug 3 2015, 5:56 PM
coren updated the task description. (Show Details)
coren raised the priority of this task from to Needs Triage.
coren claimed this task.
coren added projects: Labs, Tool-Labs, Labs-Sprint-108.
coren added a subscriber: coren.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2015, 5:56 PM

See labs-l thread about OOM. @scfc was looking into that as well.

yuvipanda renamed this task from Investigate the cause of the (apparently) spurious puppet failures on Tools to Investigate the cause of puppet failures on Tools.Aug 3 2015, 5:57 PM
yuvipanda set Security to None.

Also note that these are new puppet failures - the older ones were just noise from 'puppetmaster restarted to rotate logs' and always happened at the same time.

coren closed this task as Resolved.Aug 10 2015, 4:30 PM

I've examined the logs for the puppet failures and it does seem that the current causes are mostly genuine issues that should be looked into. (Most were the OOM pointed out above, some were manifest errors, and at least one was a package conflict).

The remainder is caused by the race condition between the puppet fileserver and the manifests - a problem which has no clear solution in sight.

I think it is safe to close this issue and revisit if we get a new rash of puppet errors that cannot be reproduced by making a puppet run on the command line.

coren moved this task from To Do to Done on the Labs-Sprint-108 board.Aug 10 2015, 4:31 PM