Page MenuHomePhabricator

Netbox missing physical device in PuppetDB when Puppet disabled for too long
Open, LowPublic

Description

https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ is currently alerting with:

cloudbackup2001 missing physical device in PuppetDB: state Active in Netbox

This is because:

The last Puppet run was at Wed May 27 23:27:35 UTC 2020

As far as I've been told, after a certain time (14d I think) of Puppet being disabled on a host, the host is purged from PuppetDB. And shows as missing in the report even sooner (but I forgot the exact reason).

In addition of having an unpuppetized and unmonitored host for so long in prod, I was wondering if:
1/ The purge timeout should be extended
2/ The host should have its status changed to FAILED or to STAGED depending on the reason Puppet is disabled (which might mean an update to the server lifecycle page as Active -> Failed is for physical maintenance only).
3/ The Netbox report should check a different way if the host is Puppetized

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

As far as I've been told, after a certain time (14d I think) of Puppet being disabled on a host, the host is purged from PuppetDB. And shows as missing in the report even sooner (but I forgot the exact reason).

That's correct we don't explicitly configure node-purge-ttl so we use the default of 14 days and nodes are purged if they have not been in-contact with the puppet master for 14 days. We also set report-ttl to 1 day, however facts and resources should still exist in the database for the full 14 days.

That said when i checked last week cloudbackup2001 still had some data in puppetdb (i.e. it had not been purged) however netbox showed an error so i think there may still be something missing to this puzzle, unfortunately i didn't have time to dig.

In addition of having an unpuppetized and unmonitored host for so long in prod, I was wondering if:
1/ The purge timeout should be extended

I think to get any reasonable change we would have to extend this by quite a bit 30+days? This has a significant impact on the size of the database and the time of lookups meaning each puppet compile would be slower. Hard to give exact numbers but personally this is my leased preferred

2/ The host should have its status changed to FAILED or to STAGED depending on the reason Puppet is disabled (which might mean an update to the server lifecycle page as Active -> Failed is for physical maintenance only).

This sounds like a good option to me, no idea the work involved

3/ The Netbox report should check a different way if the host is Puppetized

we could check the manifest.pp file?