Page MenuHomePhabricator

[Cloud VPS alert][admin-monitoring] Puppet failure on fullstackd-20210730081312.admin-monitoring.eqiad1.wikimedia.cloud (172.16.2.124)
Closed, ResolvedPublic

Description

Write the description below

Date: Fri, 30 Jul 2021 08:15:03 +0000
From: root <root@fullstackd-20210730081312.admin-monitoring.eqiad1.wikimedia.cloud>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][admin-monitoring] Puppet failure on fullstackd-20210730081312.admin-monitoring.eqiad1.wikimedia.cloud (172.16.2.124)


Puppet is having issues on the "fullstackd-20210730081312.admin-monitoring.eqiad1.wikimedia.cloud (172.16.2.124)" instance in project
admin-monitoring in Wikimedia Cloud VPS.

Puppet is running with failures.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

You might find some help here:
    https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on

For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>

Some extra info follows:
---- Last run summary:
changes: {total: 86}
events: {failure: 3, success: 86, total: 89}
resources: {changed: 85, corrective_change: 5, failed: 3, failed_to_restart: 0, out_of_sync: 88,
  restarted: 21, scheduled: 0, skipped: 1, total: 577}
time: {augeas: 0.024202556, catalog_application: 29.87210796, config_retrieval: 3.543330069999996,
  convert_catalog: 0.9437277170000016, exec: 0.2525210429999999, fact_generation: 1.9912280269999982,
  file: 9.129032023, file_line: 0.019586761, filebucket: 8.0898e-05, group: 0.000382166,
  host: 0.021642362, last_run: 1627632898, node_retrieval: 0.3417032560000024, notify: 0.001551692,
  package: 3.771781063, plugin_sync: 2.457820566999999, schedule: 0.000470519, service: 3.5470040249999997,
  tidy: 7.4171e-05, total: 39.450557831, transaction_evaluation: 29.717819590999994,
  user: 0.0005251}
version: {config: '(306b44d5ba) root - contacts: don''t fail if _role is not defined
    on labs realm', puppet: 5.5.22}

---- Exceptions that happened if any:

Event Timeline

dcaro triaged this task as High priority.Jul 30 2021, 9:41 AM
dcaro created this task.

The VM does not exist anymore, so looking at the nova-fullstack service logs on cloudcontrol1003, it seems that there were no errors:

root@cloudcontrol1003:~# journalctl -u nova-fullstack.service -S "08:13:13" -U "08:16:53" | grep -A1 /var/lib/puppet/state/last_run_summary.yaml | tail -n1
Jul 30 08:16:13 cloudcontrol1003 nova-fullstack[40200]: 2021-07-30 08:16:13,031 DEBUG b'---\nversion:\n  config: "(0f6f31ec92) Giuseppe Lavagetto - docker_registry_ha: require authentication\n    from deployment servers"\n  puppet: 5.5.22\nresources:\n  changed: 1\n  corrective_change: 0\n  failed: 0\n  failed_to_restart: 0\n  out_of_sync: 1\n  restarted: 0\n  scheduled: 0\n  skipped: 0\n  total: 575\ntime:\n  augeas: 0.006951873\n  catalog_application: 6.0923924120000095\n  config_retrieval: 3.2497446220000086\n  convert_catalog: 0.5587014620000161\n  exec: 0.14554087799999993\n  fact_generation: 0.6029334930000232\n  file: 2.8663473280000007\n  file_line: 0.009938845000000002\n  filebucket: 8.272e-05\n  group: 0.000586177\n  host: 0.000421271\n  node_retrieval: 0.5321570460000089\n  notify: 0.008468699\n  package: 1.1177100919999998\n  plugin_sync: 0.8729322519999982\n  schedule: 0.0005349420000000001\n  service: 0.873748772\n  tidy: 0.000123285\n  total: 12.000931867\n  transaction_evaluation: 6.017569092000002\n  user: 0.000474708\n  last_run: 1627632957\nchanges:\n  total: 1\nevents:\n  failure: 0\n  success: 1\n  total: 1\n'

So checking timestamps, it seems that puppet ran a first time, it failed, then ran again, it worked and let the fullstack service ssh, and then when checking the results it got the working one (as it was the last one).

My guess here is then that puppet fails to run for the first time, and passes on consecutive runs. That would explain also why don't we get the email error always, as it has to happen that the puppet error check runs right after the first run.

Looking

Is that why there was no mission for an hour? T287752 has been created for an hour.

@IN hi, the two issues seem unrelated to me, do you have more context?

In T287747#7248468, @IN wrote:

Is that why there was no mission for an hour? T287752 has been created for an hour.

Please avoid random comments on completely unrelated tasks with no idea what they're about. This task refers to issues with the way configuration and systems are managed in a Cloud VPS Project. It with certainty has nothing to do with Section Translation. Someone will get round to your task when they are available.

My work for T287309 may have introduced an early puppet failure -- now the puppet-agent tries to complete a puppet run before the firstboot script has had a change to set everything up.

If that's the cause of the issue then we should just figure out a way to not email rather. I haven't investigated enough to see if that's really what's happening though.

Change 709477 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.puppet_alert: Add failed resources to the email

https://gerrit.wikimedia.org/r/709477

Change 709483 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.puppet_alert: Don't fail if the host is not ready

https://gerrit.wikimedia.org/r/709483

Change 709477 merged by David Caro:

[operations/puppet@production] wmcs.puppet_alert: Add failed resources to the email

https://gerrit.wikimedia.org/r/709477

Change 709483 merged by David Caro:

[operations/puppet@production] wmcs.puppet_alert: Don't fail if the host is not ready

https://gerrit.wikimedia.org/r/709483

This seems fixed by the last patch, will close and reopen if it happens again.