Page MenuHomePhabricator

Decide whether decom'ing hosts mgmt DNS entry should stay or not
Closed, ResolvedPublic

Description

This came up in conversation today with @ayounsi @Volans and @jbond . Now we're exporting mgmt hostnames for all network devices (i.e. hosts, switches, etc) from netbox to puppet hiera. This includes devices in decom'ing status, for which we currently don't generate DNS mgmt entries (only for the asset tag). This leads to export unresolvable mgmt hostnames in hiera, which I'd like to avoid (I'm using the data to instruct Prometheus to probe the mgmt interfaces).

My understanding is that:

  • The hostname remains in netbox (i.e. still attached to the asset tag, even when in status decom'ing)
  • Not having hostname mgmt DNS entries comes from the previous (i.e. non cookbook/netbox based) workflow

We'd like (please @ayounsi @Volans and @jbond correct/integrate at will) DC-Ops opinion on the matter, in particular whether leaving mgmt hostnames in DNS even for decom'ing hosts is something that will cause confusion (or some other breakage). On repurposing assets the hostname will change in netbox and conseguently DNS will match the new hostname when cookbooks run.

What do you think? Thank you!

Event Timeline

So, to make a practical example, wtp1027 is in decommissioning state in Netbox, its only interface with an attached IP is the mgmt one (as it should) and its mgmt IP has the DNS Name field set to wtp1027.mgmt.eqiad.wmnet, BUT, in the DNS there is no hostname record wtp1027.mgmt.eqiad.wmnet and there is only the asset tag record: wmf7046.mgmt.eqiad.wmnet.
This is because the sre.dns.netbox cookbook generates only the asset-tag based management DNS record for hosts in decommissioning state and both of them for all other valid states.

To have consistent hieradata exported from netbox we need to either export the asset-tag names for hosts with decommissioning status or we need to have the correct data in Netbox or we need to keep generating both DNS records regardless of the Netbox status.

@wiki_willy what do you think of the above? does it seem reasonable to have mgmt hostnames for decom'ing hosts in DNS ?

Hi @fgiunchedi - I appreciate you checking with me. Let me sync up with the rest of team during my staff meeting on Thursday in case there's any additional feedback or thoughts they have, and will circle back to you with an answer then.

@wiki_willy what do you think of the above? does it seem reasonable to have mgmt hostnames for decom'ing hosts in DNS ?

@fgiunchedi @Volans the only reason we had in the pass to keep the asset tag records on servers was because at some point we used to reclaim servers and reused them but that is no longer the case. We discussed this in our meeting today that it makes sense that we no longer keep the asset tag records.
So when we run the decommission cookbook we can also at the same time remove it.

Thanks.

@Papaul the task request was actually the opposite. Not to remove the current asset tag management DNS names (wmf1234.mgmt.$DC.wmnet) upon decommission, but actually to leave the currently removed hostname management DNS names (foo1001.mgmt.$DC.wmnet) upon decommissioning.

Removing all DNS management names when decommissioning a host seems a step back to me because:

  • The management console (iDRAC/iLO) will keep having that IP and be reachable on it until unracked, just without any associated DNS name
  • Although the Redfish module in Spicerack uses the IPs and not the DNS name, if anyone would want to ssh into a mgmt interface after decommission they will not have anymore a DNS name to point to, but will be forced to find and use the IP.
  • The monitoring will need to be adapted to use the IP and will be more complex that now.

Thanks @Papaul and @Volans for following up -- my understanding is that since dcops would be OK with removing asset tags records then you'd be ok with having asset tags and hostnames records in DNS mgmt until the host is unracked (?)

Change 849495 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/software/netbox-extras@master] dns: generate HOST.mgmt records in all statuses

https://gerrit.wikimedia.org/r/849495

Summary of the meeting I/F tooling and automation had with @Papaul today:

The PoV of DC-Ops is that the current setup requires an additional run of the sre.dns.netbox cookbook when offlining an host just to remove the asset-tag-based DNS record for little or no value.
Nowadays the possibility to re-purpose a decommissioning server is very low, they are basically rare exceptions. The normal workflow for decommissioned hosts is to be offlined (unracked) and recycled.
So the counter-proposal from DC-Ops with the additional detail we discussed today is:

  • clear the DNS name of the mgmt IP in Netbox at decommissioning time
  • so that when the sre.dns.netbox cookbook is run by the decommissioning cookbook it will be removing both DNS records
  • the netbox-hiera export should not export the hosts in decommissioning status so they will just not be monitored, and it should not be a problem given the rare case of repurposing.

If for any reason we'll need to connect to the mgmt interface after the decommission, that will still be possible by IP, that will not be removed from Netbox at decommissioning time but only at offlining time (the Netbox offline script).
Also the redfish module in Spicerack uses the IPs and not the hostname (for unrelated reasons) already so it will keep working.

Change 849495 abandoned by Filippo Giunchedi:

[operations/software/netbox-extras@master] dns: generate HOST.mgmt records in all statuses

Reason:

As per reasoning on task

https://gerrit.wikimedia.org/r/849495

I have gone ahead and excluded decommissioning hosts from syncing hiera data, will let @Volans take care of the cookbook bits

Change 852738 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] dns: skip mgmt records for decommissioning devices

https://gerrit.wikimedia.org/r/852738

Change 852739 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.decommission: unset mgmt DNS name

https://gerrit.wikimedia.org/r/852739

Change 852738 merged by jenkins-bot:

[operations/software/netbox-extras@master] dns: skip mgmt records for decommissioning devices

https://gerrit.wikimedia.org/r/852738

Change 852806 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] dns: silence log for decommissioned devices

https://gerrit.wikimedia.org/r/852806

I've merged the DNS patch and run the sre.dns.netbox cookbook to remove the related records from our DNS.
Then I've removed the DNS Name from the mgmt interface's IP in Netbox for the related devices with:

>>> import uuid
>>> request_id = uuid.uuid4()
>>> user = User.objects.get(username='volans')
>>> def update(d):
...     for i in d.interfaces.get(name='mgmt').ip_addresses.all():
...         i.dns_name = ''
...         log = i.to_objectchange('update')
...         log.request_id = request_id
...         log.user = user
...         log.save()
...         i.save()
>>> 
>>> devices = Device.objects.filter(status='decommissioning', device_role__slug='server', tenant__isnull=True)
>>> [d.name for d in devices]
['cloudstore1008', 'cp5001', 'ganeti1008', 'ganeti4003', 'restbase2009', 'restbase2010', 'restbase2011', 'torrelay1001', 'wtp1025', 'wtp1026', 'wtp1027', 'wtp1028', 'wtp1029', 'wtp1030', 'wtp1031', 'wtp1032', 'wtp1033', 'wtp1034', 'wtp1035', 'wtp1036', 'wtp1037', 'wtp1038', 'wtp1039', 'wtp1040', 'wtp1041', 'wtp1042', 'wtp1043', 'wtp1044', 'wtp1045', 'wtp1046', 'wtp1047', 'wtp1048']

This is the Netbox changelog. I've done manually frauth1001 and frlog1001 to make sure there weren't other side effects being FR-tech.

Volans triaged this task as Medium priority.

Change 852739 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.decommission: unset mgmt DNS name

https://gerrit.wikimedia.org/r/852739

A side effect of the above changes is that now the sre.hosts.decommission cookbook fails if run on a host that has been already decommissioned, I'll try to find a fix for it.

Change 852903 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] ipmi: clarify that the target can also be an IP

https://gerrit.wikimedia.org/r/852903

Change 852955 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.decommission: use mgmt IP if no DNS

https://gerrit.wikimedia.org/r/852955

Change 852955 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.decommission: use mgmt IP if no DNS

https://gerrit.wikimedia.org/r/852955

The fix for the decommissioning cookbook has been deployed, it uses the IP address when the DNS record is not present anymore.
This should complete the task, resolving. Feel free to re-open if anything is missing.

Change 852903 merged by jenkins-bot:

[operations/software/spicerack@master] ipmi: clarify that the target can also be an IP

https://gerrit.wikimedia.org/r/852903

Change 852806 merged by jenkins-bot:

[operations/software/netbox-extras@master] dns: silence log for decommissioned devices

https://gerrit.wikimedia.org/r/852806

Change 855048 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/netbox-extras@master] reports: Network ignore empty DNS names

https://gerrit.wikimedia.org/r/855048

Change 855048 merged by jenkins-bot:

[operations/software/netbox-extras@master] reports: Network ignore empty DNS names

https://gerrit.wikimedia.org/r/855048