This issue comes back regularly, especially through Netbox reports https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
For example:
* Device with failed status but reachable
* Device is XXX in Netbox but is missing from PuppetDB (should be ('inventory', 'offline', 'planned', 'decommissioning', 'failed'))
* Device is In PuppetDB but with status staged
There are 2 main reasons for which this happens,
The main one is the necessity of Manual status changes, as defined in https://wikitech.wikimedia.org/wiki/Server_Lifecycle (Eg. "The service owner changes Netbox's to ACTIVE.")
This doesn't work as people forget to do such changes. Most of those oversight get caught by the netbox reports, but there is an asymmetry between who looks at the reports and who cause the error (eg. if someone doesn't think about changing the status, they won't think about checking the report). This is exacerbated by the report being virtually always triggered (and thus not triggering an alert on IRC, such alerts being mostly ignored as well).
The ideal fix is to abstract all those status changes through automation but it's not straightforward as some of those states are subjective and depend on the service owners (eg. active vs. staged).
As a first step I suggest that we identify on https://wikitech.wikimedia.org/wiki/File:Server_Lifecycle_Statuses.png which transitions are manual vs. automated.
Then as defined in [[ https://wikitech.wikimedia.org/wiki/Netbox#Netbox_(and_source_of_truth)_principles | Netbox (and source of truth) principles ]]
"All data manually entered will go stale" -> "Refrain from adding data that will not drive the infrastructure"
Currently the status is mostly informative, we could make it more compelling by driving production from it. For example if a server have a FAILED status, use Ferm to block all ports except SSH (just a suggestion, other ideas welcome). Or if a server doesn't have an ACTIVE state, don't allow it to be polled by Pybal.
Last, if the previous 2 points are not possible (or in addition to them) we should improve alerting and user notifications.
One idea is to use the new export from {T229397} to add a loud/clear MOTD when the server is not in ACTIVE state. Or have a per server alert in AlertManager instead of the current global report alert.
The other reason I identified is servers "offline" for long enough that they are being evicted from PuppetDB (and becoming ghost hosts on the network). For example with T306835#8211485.
A possibility here is to have the re-image cookbook automatically set the host status to FAILED if the re-imaged failed and the status was ACTIVE/STAGED
And then automatically set the status to STAGED if the re-image finally works (from a previous FAILED status).
Thoughts?