Every time a database host is decommissioned, removal from Orchestrator as documented in https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FDecommissioning_a_DB_Host#Remove_host_from_orchestrator should happen. This action could potentially be included in the sre.hosts.decommission cookbook. One thing to consider is making sure that the removal happens at the right point in the process so that the host is not re-discovered by Orchestrator.
Description
Event Timeline
How about we create a mechanism similar to the logout.d scripts, but for decom? Let's say we create a new /etc/wikimedia/decom.d directory where each service can (in this case it would be installed on each DB host managed in Orchestrator) drop a decom script with steps which ought to be taken when a host running this service gets decommed. These files would get executed by decom cookbook locally (and would trigger the de-registration on orch1001). This way we can flexibly extend custom decom workflows like this without tieing this a change in the decom cookbook (also also keep it more lean).
@MoritzMuehlenhoff 's proposal is certainly a neat option but I have a couple of worries, namely:
- it might be hard to find the right moment to run those scripts for every service, without adding a lot of "hooks" during the decommissioning process to run the appropriate ones (some might need to run something before the host is removed from puppet, some after, etc...)
- some "decom" actions should be performed from a central host instead of the target hosts because maybe we don't want to open some API to all hosts but only to some central hosts (like the way we remove hosts from debmonitor)
@Marostegui is this request still valid/needed?
If we are going to add this steps I would need to know:
- should we use orchestrator's APIs or the CLI on the orchestrator host?
- what would be the best way to get all the ports the hosts was registered in orchestrator?
- I guess we could do it in all cases, querying orchestrator to see if the host is there and then remove it if not
Setting low priority for now as this is an old task.
I think we can probably decline this. Orchestrator removes the host itself after 14 days, which I should probably change and reduced it a little bit via UnseenInstanceForgetHours variable.