Page MenuHomePhabricator

CloudVPS: VMs with broken puppet 2019-07-14
Closed, ResolvedPublic

Description

Today 2019-07-14 I checked the state of all VMs in CloudVPS regarding puppet.

aborrero@labpuppetmaster1001:~ $ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
[..]
    "apps-talk-pages.mobile.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "deployment-cache-upload05.deployment-prep.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "deployment-server.analytics.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "filippo-log-jessie01.logging.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "gerrit-test4.gerrit.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "jeh-puppetmaster.testlabs.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "keith-emostash2.logging.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "orig-01.wikibase-registry.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "pk8s.planet.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "puppet-jmm-pmaster-client.puppet.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "quarry-dev-01.quarry.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "rec-wiki.recommendation-api.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "toolsbeta-paws-master-01.toolsbeta.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "toolsbeta-paws-worker-1001.toolsbeta.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "toolsbeta-paws-worker-1002.toolsbeta.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "upgrader-04.library-upgrader.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "wikidata-autodesc.wikidata-autodesc.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",

We should probably contact the owners to clarify the status of these VMs. On the other hand, owners should be getting daily emails (warnings) about them, so it would be interesting to know for how long they are been in this state.

Event Timeline

On 2019-10-10:

bd808@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
[...]
    "af-nb-db-1.automation-framework.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "af-netbox01.automation-framework.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "af-puppetdb01.automation-framework.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "af-puppetdb02.automation-framework.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "deployment-server.analytics.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "filippo-log-jessie01.logging.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "puppet-jmm-pmaster-client.puppet.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
Phamhi subscribed.

On 2019-11-27:

phamhi@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
...
    "af-netbox01.automation-framework.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "deployment-server.analytics.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "puppet-jmm-pmaster-client.puppet.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",

On 2019-12-09:

phamhi@cloud-cumin-01:~$ sudo cumin --force --timeout 500 -o json  "A:all" "/usr/local/lib/nagios/plugins/check_puppetrun -w 3600 -c 86400" | grep "Catalog fetch fail"
...
    "deployment-server.analytics.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",
    "puppet-jmm-pmaster-client.puppet.eqiad.wmflabs": "CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues",