checking acmechief1001 after the following icinga alert has been triggered:
PROBLEM - Ensure that passive node gets the certificates from the active node as expected on acmechief2001 is CRITICAL: FILE_AGE CRITICAL: /var/lib/acme-chief/certs/.rsync.status is 7246 seconds old and 0 bytes https://wikitech.wikimedia.org/wiki/Acme-chief
puppet fails to run due to memory issues:
vgutierrez@acmechief1001:~$ sudo -i puppet agent -t 2019-09-28 16:27:06.261665 WARN puppetlabs.facter - locale environment variables were bad; continuing with LANG=C LC_ALL=C Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Retrieving locales Info: Loading facts Info: Caching catalog for acmechief1001.eqiad.wmnet Info: Applying configuration version '1569688031' Error: /Stage[main]/Base::Standard_packages/Package[libperl5.24]: Could not evaluate: Cannot allocate memory - fork(2) Error: /Stage[main]/Base::Standard_packages/Package[ruby2.3]: Could not evaluate: Cannot allocate memory - fork(2) Error: /Stage[main]/Base::Standard_packages/Package[libruby2.3]: Could not evaluate: Cannot allocate memory - fork(2) Error: /Stage[main]/Base::Standard_packages/Package[libunbound2]: Could not evaluate: Cannot allocate memory - fork(2) Error: /Stage[main]/Base::Standard_packages/Package[mcelog]: Could not evaluate: Cannot allocate memory - fork(2) ...
a quick check on the process list shows acme-chief-backend consuming a 80% of the memory:
acme-ch+ 19080 2.0 80.0 2598632 1634560 ? Ss Sep23 155:55 /usr/bin/python3 /usr/bin/acme-chief-backend
This is the first time that this kind of behaviour has been observed so it looks like the issue has been introduced with acme-chief 0.21