Write the description below
Alert on https://vpsalertmanager.toolforge.org:
SUMMARY: Puppet agent failure detected on instance paws-k8s-control-3 in project paws
5 minutes ago
instance: paws-k8s-control-3
Alert on https://vpsalertmanager.toolforge.org:
SUMMARY: Puppet agent failure detected on instance paws-k8s-control-3 in project paws
5 minutes ago
instance: paws-k8s-control-3
Mentioned in SAL (#wikimedia-cloud) [2021-07-21T08:01:25Z] <dcaro> Manually run puppet, and it started systemd-timsyncd but did not fail (T287068)
systemd-timsyncd was timing out to contact any servers, until I ran it manually:
Jul 21 07:53:31 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization... Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Failed to start Network Time Synchronization. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 59. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Stopped Network Time Synchronization. Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization... Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Failed to start Network Time Synchronization. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 60. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Stopped Network Time Synchronization. Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization... Jul 21 07:58:45 paws-k8s-control-3 systemd-timesyncd[24379]: Synchronized to time server for the first time 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org). Jul 21 07:59:25 paws-k8s-control-3 systemd[1]: Started Network Time Synchronization.
On the server it was able to connect to (ntp-01) there's no clear indication of any suspicious activity changing
anything during that time (checked that the time was not too off when comparing timestamps):
root@ntp-01:~# journalctl -S "07:54:00" -U "07:59:00" -- Logs begin at Fri 2021-07-16 20:24:45 UTC, end at Wed 2021-07-21 08:13:09 UTC. -- Jul 21 07:54:05 ntp-01 systemd[1]: Started Update Debian version stat exported by node_exporter. Jul 21 07:54:05 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats. Jul 21 07:55:01 ntp-01 CRON[13697]: pam_unix(cron:session): session opened for user root by (uid=0) Jul 21 07:55:01 ntp-01 systemd[1]: Started Regular job to collect active shell session information. Jul 21 07:55:01 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats. Jul 21 07:55:01 ntp-01 CRON[13701]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Main process exited, code=exited, status=1/FAILURE Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Unit entered failed state. Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Failed with result 'exit-code'. Jul 21 07:55:02 ntp-01 CRON[13697]: pam_unix(cron:session): session closed for user root Jul 21 07:56:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats. Jul 21 07:57:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats. Jul 21 07:58:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.