Our current IPMI monitoring is covering some of the failure scenarios but not all of them, in particular it doesn't check if the remote IPMI login works and we had recently some cases of hosts on which remote IPMI was failing for different reasons (wrong configuration, misaligned password).
In the monitoring meeting it was agreed that a not-so-frequent (like once a week) check of remote IPMI login would be acceptable and not harm the management console that are known for not being too stable under continuous access.
It was agreed that the probability of a host failing and in need of remote IPMI and the concurrent failure of remote IPMI in the same one week period is low enough to be acceptable.
As a precaution the check should be enabled for a small part of the fleet, let's say 10%, for few weeks/months to ensure that the management consoles will not fail because of this.
Those are the proposed solutions with PRO/CON:
- Add an additional RO user to the IPMI configuration fleet wide and as part as the provisioning of new hosts and use that for the check from the monitoring hosts.
- PRO: secure, direct check from the monitoring hosts
- CON: doesn't guarantee that the RW user will still be working even if the check is ok; add complexity to the provisioning procedure of new hosts and it will require a first sweep to set it on the whole fleet
- Perform the check with the current RW user. For security reasons we don't want to have this password available on the monitoring hosts and also for security reasons we have NRPE parameters disabled. Talking with @MoritzMuehlenhoff we agreed that it would be acceptable to have this password available on the management hosts (sarin/neodymium), saved only in memory in a way similar to how keyholder works. But for the lack NRPE parameters, we'll need to expose a service that will do a harmless RO operation on the given hostname's remote IPMI via HTTP(S) or any other protocol so that the monitoring hosts will call this service to check the remote IPMI of a given hostname
- PRO: checks the same user we'll use for remote IPMI
- CON: more work, requires to write a small service and its puppetization; not ideal from the security point of view
Feel free to propose alternative way of reaching the same goal.