Page MenuHomePhabricator

[Cloud VPS alert] Puppet failure on fullstackd-20210630081316.admin-monitoring.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

Write the description below

Puppet is failing to run on the "fullstackd-20210630081316.admin-monitoring.eqiad1.wikimedia.cloud" instance in Wikimedia Cloud VPS.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance. Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

Event Timeline

dcaro triaged this task as High priority.Jun 30 2021, 8:51 AM
dcaro created this task.

Can't ssh, checking the VM console logs...

The VM has been alerady deleted, going to cloudcontrol1003 and looking at the nova-fullstack service logs, found the
following:

Jun 30 08:13:38 cloudcontrol1003 nova-fullstack[17597]: 2021-06-30 08:13:38,344 INFO Resolving fullstackd-20210630081316.admin-monitoring.eqiad1.wikimedia.cloud from ['208.80.154.143', '208.80.154.24']
...
Jun 30 08:15:52 cloudcontrol1003 nova-fullstack[17597]: 2021-06-30 08:15:52,088 INFO Verify Puppet run on 172.16.6.94
Jun 30 08:15:52 cloudcontrol1003 nova-fullstack[17597]: 2021-06-30 08:15:52,088 DEBUG /usr/bin/ssh -o ConnectTimeout=5 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o NumberOfPasswordPrompts=0 -o LogLevel=ERROR -o ProxyCommand="ssh -o StrictHostKeyChecking=no -i /var/lib/osstackcanary/osstackcanary_id -W
 %h:%p osstackcanary@185.15.56.13" -i /var/lib/osstackcanary/osstackcanary_id osstackcanary@172.16.6.94 sudo cat /var/lib/puppet/state/last_run_summary.yaml
Jun 30 08:15:53 cloudcontrol1003 nova-fullstack[17597]: 2021-06-30 08:15:53,854 DEBUG b'---\nversion:\n  config: "(4d56b5dad5) Muehlenhoff - Convert sretest-logout.py to wmflib.idm"\n  puppet: 5.5.22\nresources:\n  changed: 1\n  corrective_change: 0\n  failed: 0\n  failed_to_restart: 0\n  out_of_sync: 1\n  restarted: 0
\n  scheduled: 0\n  skipped: 0\n  total: 572\ntime:\n  augeas: 0.013420909\n  catalog_application: 5.284840900000006\n  config_retrieval: 3.056884027999999\n  convert_catalog: 0.3879573540000081\n  exec: 0.115499124\n  fact_generation: 0.4265277190000063\n  file: 2.5517806939999956\n  file_line: 0.009484509999999998\n
 filebucket: 5.3519e-05\n  group: 0.000522323\n  host: 0.000427164\n  node_retrieval: 0.34602264899999113\n  notify: 0.004818269\n  package: 1.0284287440000004\n  plugin_sync: 0.5483560069999953\n  schedule: 0.000309881\n  service: 0.6656411299999999\n  tidy: 0.000139307\n  total: 10.134681187\n  transaction_evaluation
: 5.223228191999993\n  user: 0.001029618\n  last_run: 1625040943\nchanges:\n  total: 1\nevents:\n  failure: 0\n  success: 1\n  total: 1\n'

That seems to point that the puppet run did not fail (at least not that one).

Looking to see if I find anything else, but might want to extend the alert to add more useful logs.

Change 702331 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.puppet_alert: Add more info and differentiate cases

https://gerrit.wikimedia.org/r/702331

Change 702331 merged by David Caro:

[operations/puppet@production] wmcs.puppet_alert: Add more info and differentiate cases

https://gerrit.wikimedia.org/r/702331

This seems fixed by the same patch as T287747, will reopen if it happens again