Page MenuHomePhabricator

Puppet agent failure detected on instance paws-k8s-control-3 in project paws
Closed, ResolvedPublic

Description

Write the description below

Alert on https://vpsalertmanager.toolforge.org:

SUMMARY: Puppet agent failure detected on instance paws-k8s-control-3 in project paws
5 minutes ago
instance: paws-k8s-control-3

Event Timeline

dcaro triaged this task as High priority.Jul 21 2021, 7:56 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-07-21T08:01:25Z] <dcaro> Manually run puppet, and it started systemd-timsyncd but did not fail (T287068)

systemd-timsyncd was timing out to contact any servers, until I ran it manually:

Jul 21 07:53:31 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization...
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Failed to start Network Time Synchronization.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 59.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Stopped Network Time Synchronization.
Jul 21 07:55:01 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization...
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Start operation timed out. Terminating.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Main process exited, code=killed, status=15/TERM
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Failed with result 'timeout'.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Failed to start Network Time Synchronization.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 60.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Stopped Network Time Synchronization.
Jul 21 07:56:31 paws-k8s-control-3 systemd[1]: Starting Network Time Synchronization...
Jul 21 07:58:45 paws-k8s-control-3 systemd-timesyncd[24379]: Synchronized to time server for the first time 172.16.2.81:123 (ntp-01.cloudinfra.wmflabs.org).
Jul 21 07:59:25 paws-k8s-control-3 systemd[1]: Started Network Time Synchronization.

On the server it was able to connect to (ntp-01) there's no clear indication of any suspicious activity changing
anything during that time (checked that the time was not too off when comparing timestamps):

root@ntp-01:~# journalctl -S "07:54:00" -U "07:59:00"
-- Logs begin at Fri 2021-07-16 20:24:45 UTC, end at Wed 2021-07-21 08:13:09 UTC. --
Jul 21 07:54:05 ntp-01 systemd[1]: Started Update Debian version stat exported by node_exporter.
Jul 21 07:54:05 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.
Jul 21 07:55:01 ntp-01 CRON[13697]: pam_unix(cron:session): session opened for user root by (uid=0)
Jul 21 07:55:01 ntp-01 systemd[1]: Started Regular job to collect active shell session information.
Jul 21 07:55:01 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.
Jul 21 07:55:01 ntp-01 CRON[13701]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Unit entered failed state.
Jul 21 07:55:01 ntp-01 systemd[1]: prometheus_ssh_open_sessions.service: Failed with result 'exit-code'.
Jul 21 07:55:02 ntp-01 CRON[13697]: pam_unix(cron:session): session closed for user root
Jul 21 07:56:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.
Jul 21 07:57:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.
Jul 21 07:58:09 ntp-01 systemd[1]: Started Regular job to collect puppet agent stats.

Everything seems now stable, closing