Page MenuHomePhabricator

[Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191)
Closed, ResolvedPublic

Description

Write the description below

From email:

Date: Wed, 25 Aug 2021 08:15:08 +0000
From: root <root@paws-k8s-haproxy-1.paws.eqiad.wmflabs>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][paws] Puppet failure on paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191)


Puppet is having issues on the "paws-k8s-haproxy-1.paws.eqiad.wmflabs (172.16.0.191)" instance in project
paws in Wikimedia Cloud VPS.

Puppet is running with failures.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

If your host is expected to fail puppet runs and you want to disable this
alert, you can create a file under /.no-puppet-checks, that will skip the checks.

You might find some help here:
    https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on

For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>

Some extra info follows:
---- Last run summary:
changes: {total: 1}
events: {failure: 1, success: 1, total: 2}
resources: {changed: 1, corrective_change: 1, failed: 1, failed_to_restart: 0, out_of_sync: 2,
  restarted: 0, scheduled: 0, skipped: 0, total: 772}
time: {augeas: 0.009310183, catalog_application: 188.33509273268282, config_retrieval: 6.249314604327083,
  convert_catalog: 0.5936704073101282, exec: 0.14733992200000004, fact_generation: 0.8722927533090115,
  file: 3.735077121999996, file_line: 0.011995403000000002, filebucket: 0.000210518,
  group: 0.000750726, host: 0.000989663, last_run: 1629878454, node_retrieval: 0.4398117121309042,
  notify: 0.007440145, package: 1.3270439319999996, plugin_sync: 0.8295837696641684,
  schedule: 0.0007430700000000001, service: 181.18330245200002, tidy: 0.000216749,
  total: 197.427444355, transaction_evaluation: 188.20645548216999, user: 0.001130999}
version: {config: '(e9aa6938a9) Manuel Arostegui - install_server: Reimage db1160
    to Buster', puppet: 5.5.22}


---- Failed resources if any:

  * Service[systemd-timesyncd]

---- Exceptions that happened when running the script if any:
  No exceptions happened.

Event Timeline

dcaro triaged this task as High priority.Aug 25 2021, 8:18 AM
dcaro created this task.

It's timing out trying to sync the time, it's happening on several instances, looking:

root@paws-k8s-haproxy-1:~# run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for paws-k8s-haproxy-1.paws.eqiad.wmflabs
Info: Applying configuration version '(e9aa6938a9) Manuel Arostegui - install_server: Reimage db1160 to Buster'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'


Error: Systemd start for systemd-timesyncd failed!
journalctl log for systemd-timesyncd:
-- Logs begin at Mon 2021-08-23 19:47:05 UTC, end at Wed 2021-08-25 08:22:28 UTC. --
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 36.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 37.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 38.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...

Error: /Stage[main]/Standard::Ntp::Timesyncd/Service[systemd-timesyncd]/ensure: change from 'stopped' to 'running' failed: Systemd start for systemd-timesyncd failed!
journalctl log for systemd-timesyncd:
-- Logs begin at Mon 2021-08-23 19:47:05 UTC, end at Wed 2021-08-25 08:22:28 UTC. --
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 36.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:18:56 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 37.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:20:58 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 38.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 08:22:28 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
 (corrective)
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 188.77 seconds

It started failing this morning at ~7:30:

Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Stopping Network Time Synchronization...
### It worked here
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Succeeded.
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.
Aug 25 07:24:05 paws-k8s-haproxy-1 systemd[1]: Starting Network Time Synchronization...
### First failure
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: Failed to start Network Time Synchronization.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 1.
Aug 25 07:25:35 paws-k8s-haproxy-1 systemd[1]: Stopped Network Time Synchronization.

When running manually with debug logs it was passing:

root@paws-k8s-haproxy-1:~# SYSTEMD_LOG_LEVEL=debug /lib/systemd/systemd-timesyncd
...
Connecting to time server 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org).
Sent NTP request to 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org).
NTP response:
  leap         : 0
  version      : 4
  mode         : 4
  stratum      : 2
  precision    : 0.000000 sec (-24)
  root distance: 0.033295 sec
  reference    : n/a
  origin       : 1629881115.524
  receive      : 1629881115.523
  transmit     : 1629881115.523
  dest         : 1629881115.525
  offset       : -0.002 sec
  delay        : +0.001 sec
  packet count : 1
  jitter       : 0.000
  poll interval: 64
  adjust (slew): -0.002 sec
  status       : 8193 sync
  time now     : 1629881115.525
  constant     : 2
  offset       : -0.002 sec
  freq offset  : +2108154 (+32 ppm)
interval/delta/delay/jitter/drift 64s/-0.002s/0.001s/0.000s/+32ppm
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/timesync1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=3 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Synchronized to time server for the first time 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org)
...

Then I added the extra dubgging on the service file, and restarted the service, and the service did not fail either:

root@paws-k8s-haproxy-1:~# cat /lib/systemd/system/systemd-timesyncd.service;
...
[Service]
Environment="SYSTEMD_LOG_LEVEL=debug"
...

Will disable debugging see if that matters, and if not, enable debugging and leave it running for a bit see if it fails
and we get some logs.

There were many errors (from way before the first ntp error) about the nfsmount of cloudstore1009.wikimedia.org timing out,
I manually ran umount -l /mnt/nfs/secondary-cloudstore1009.wikimedia.org-scratch, an then it started working
continuously, so the issue seems related to that (though why timesyncd is bound to some nfs mount still escapes me).

Looking on the nfs side now.

Remounted the failing nfs mount, and now everything looks ok, there's no more nfs timeout messagen on dmesg, and the
timesyncd service starts correctly, closing.