Page MenuHomePhabricator

[Cloud VPS alert][cloudvirt-canary] Puppet failure on canary1044-01.cloudvirt-canary.eqiad1.wikimedia.cloud (172.16.3.177)
Closed, ResolvedPublic

Description

Write the description below

Email received with the alert:

Date: Mon, 26 Jul 2021 08:15:03 +0000
From: root <root@canary1044-01.cloudvirt-canary.eqiad1.wikimedia.cloud>
To: dcaro@wikimedia.org
Subject: [Cloud VPS alert][cloudvirt-canary] Puppet failure on canary1044-01.cloudvirt-canary.eqiad1.wikimedia.cloud (172.16.3.177)


Puppet is having issues on the "canary1044-01.cloudvirt-canary.eqiad1.wikimedia.cloud (172.16.3.177)" instance in project
cloudvirt-canary in Wikimedia Cloud VPS.

Puppet is running with failures.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance.  Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

You might find some help here:
    https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Cloud_VPS_alert_Puppet_failure_on

For further support, visit #wikimedia-cloud on libera.chat or
<https://wikitech.wikimedia.org>

Some extra info follows:
---- Last run summary:
changes: {total: 1}
events: {failure: 1, success: 1, total: 2}
resources: {changed: 1, corrective_change: 0, failed: 1, failed_to_restart: 0, out_of_sync: 2,
  restarted: 0, scheduled: 0, skipped: 0, total: 573}
time: {augeas: 0.016067946, catalog_application: 6.311372339725494, config_retrieval: 3.837283907458186,
  convert_catalog: 0.4087558565661311, exec: 0.430169381, fact_generation: 0.8493666732683778,
  file: 2.9797071339999985, file_line: 0.011917505, filebucket: 4.9002e-05, group: 0.000592379,
  host: 0.000401658, last_run: 1627285550, node_retrieval: 0.6049147974699736, notify: 0.005156779,
  package: 1.117646943, plugin_sync: 0.9399111736565828, schedule: 0.000252771, service: 0.7661277689999997,
  tidy: 0.000176439, total: 13.027807069, transaction_evaluation: 6.254771231673658,
  user: 0.000751797}
version: {config: '(32bd2ba79c) Bstorm - toolforge harbor: puppetize experimental
    base server for harbor', puppet: 5.5.22}

---- Exceptions that happened if any:

Event Timeline

Manually ran puppet on the VM, it seems to not have enough memory to run puppet:

dcaro@canary1044-01:~$ sudo -i
root@canary1044-01:~# puppet agent --test
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for canary1044-01.cloudvirt-canary.eqiad1.wikimedia.cloud
Info: Applying configuration version '(9db8aeac15) Muehlenhoff - Fix permissions for /usr/sbin/policy-rc.d'
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Error: /Stage[main]/Ldap::Client::Sssd/Package[sudo-ldap]: Could not evaluate: Cannot allocate memory - fork(2)
Error: /Stage[main]/Base::Standard_packages/Base::Service_auto_restart[systemd-journald]/Systemd::Timer::Job[wmf_auto_restart_systemd-journald]/Systemd::Timer[wmf_auto_restart_systemd-journald]/Systemd::Service[wmf_auto_restart_systemd-journald]/Service[wmf_auto_restart_systemd-journald.timer]: Could not evaluate: Cannot allocate memory - fork(2)
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 7.06 seconds
root@canary1044-01:~# free -m
              total        used        free      shared  buff/cache   available
Mem:            484         234         102           5         147         232
Swap:             0           0           0

Killing diamond seems to be enough, maybe we can try to not start it to begin with.

Mentioned in SAL (#wikimedia-cloud) [2021-07-26T13:31:35Z] <dcaro> disabled diamond on the machines (T287350)