Page MenuHomePhabricator

[Cloud VPS alert][cloudinfra-codfw1dev] Puppet failure on ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (172.16.128.179)
Closed, ResolvedPublic

Description

From email:

Date: Thu, 25 Nov 2021 08:15:05 +0000
From: root <root@ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][cloudinfra-codfw1dev] Puppet failure on ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (172.16.128.179)


Puppet is having issues on the "ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud (172.16.128.179)" instance in project
cloudinfra-codfw1dev in Wikimedia Cloud VPS.

Puppet is running with failures.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

If your host is expected to fail puppet runs and you want to disable this
alert, you can create a file under /.no-puppet-checks, that will skip the checks.

You might find some help here:
    https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on

For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>


Some extra info follows:
---- Last run summary:
changes:
  total: 1
events:
  failure: 1
  success: 1
  total: 2
resources:
  changed: 1
  corrective_change: 1
  failed: 1
  failed_to_restart: 0
  out_of_sync: 2
  restarted: 0
  scheduled: 0
  skipped: 0
  total: 443
time:
  augeas: 0.008286048
  catalog_application: 5.071433247998357
  config_retrieval: 2.4902771930210292
  convert_catalog: 0.3685375089989975
  exec: 0.12376264499999999
  fact_generation: 0.32436568301636726
  file: 2.551211253
  file_line: 0.002842567
  filebucket: 4.852e-05
  group: 0.000927392
  last_run: 1637826820
  node_retrieval: 0.2779838900314644
  notify: 0.004758526
  package: 0.559228629
  plugin_sync: 0.6879859289620072
  schedule: 0.000465333
  service: 0.9200921579999999
  tidy: 0.000304051
  total: 9.284562957
  transaction_evaluation: 4.9448587680235505
  user: 0.00102175
version:
  config: '(1b14071c2e) Manuel Arostegui - db1128: Move it to test-s1'
  puppet: 5.5.22


---- Failed resources if any:

  * Service[systemd-timesyncd]

---- Exceptions that happened when running the script if any:
  No exceptions happened.

This has been failing for a while.

Event Timeline

dcaro triaged this task as High priority.Thu, Nov 25, 9:22 AM
dcaro created this task.
dcaro moved this task from To refine to Today on the User-dcaro board.
dcaro changed the task status from Open to In Progress.Thu, Nov 25, 9:40 AM
dcaro moved this task from Today to Doing on the User-dcaro board.

Change 741849 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] timesyncd: add package requirement

https://gerrit.wikimedia.org/r/741849

These machines are not supposed to have timesyncd present (as they are running ntpd to act as a ntp server). I see the hiera key

profile::systemd::timesyncd::ensure: absent

in Horizon. Is that not working for some reason?

So, that host uses the project internal puppetmaster:

root@ntp-1:~# puppet config --section agent print server
2021-11-25 10:32:03.546464 WARN  puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C
cloudinfra-internal-puppetmaster-01.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud

That server's enc seems not to report that value:

root@cloudinfra-internal-puppetmaster-01:~# puppet config --section master print external_nodes
/usr/local/bin/puppet-enc

root@cloudinfra-internal-puppetmaster-01:~# /usr/local/bin/puppet-enc ntp-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud
classes: ['role::wmcs::services::ntp']
parameters: {}

Just noticed it's using codfw1dev, so labtest, xd
looking

Change 741849 merged by David Caro:

[operations/puppet@production] timesyncd: handle bullseye ntp hosts

https://gerrit.wikimedia.org/r/741849

Change 742107 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] timsyncd: Flip the handling service condition

https://gerrit.wikimedia.org/r/742107

Change 742107 merged by David Caro:

[operations/puppet@production] timsyncd: Flip the handling service condition

https://gerrit.wikimedia.org/r/742107

dcaro moved this task from Doing to Done on the User-dcaro board.