This is a Tracking-Neverending task for any puppet errors we might encounter on the beta cluster infrastructure. Puppet errors typically occurs when Puppet changes are made that are not necessarily taking in account beta (typically, class being renamed but not updated in wikitech, hiera parameters missing ...).
Description
Event Timeline
Puppet failures related to Ganglia would be due to T134808 which have fixed cherry picked on beta puppet master but do not cover every cases.
Is this really best as a tracking task or should we add it to the deployment-prep workboard column? The task by its nature is always gonna be open (or reopened).
It's fine with me if you want to move them all to a particular workboard column instead of a tracking task
-snapshot01 is T184270 (package it wants is missing from stretch, moritz to fix when higher priority things are done)
As of today there are 16 shinken alerts (most puppet but at least one disk warning) on this project, and three VMs that are shut down but not deleted. All of this is viewable here: http://shinken.wmflabs.org/problems?search=deployment
deployment-videoscaler01 seems no longer exist?
$ ssh -a deployment-videoscaler01 channel 0: open failed: connect failed: No route to host / stdio forwarding failed ssh_exchange_identification: Connection closed by remote host
@Andrew et al. Some docs on Wikitech on usual puppet errors and how to fix them would IMHO help. I feel some of us who has access to deployment-prep could help if we had some guidance. Also, IRC assistance would be great, if at all possible. Thanks.
I've now created an "Puppet errors" column on the Beta-Cluster-Infrastructure workboard and moved all open subtasks to that column.