Page MenuHomePhabricator

af-nb-db-2.automation-framework.eqiad.wmflabs has broken network
Closed, ResolvedPublic

Description

For T232429: Create in-cloud, cloud-vps-wide cumin masters I was looking at what hosts are *not* currently responding to cumin. One of them is this host with this error:
ssh: connect to host af-nb-db-2.automation-framework.eqiad.wmflabs port 22: No route to host
Now that's interesting, that should never happen for a host under eqiad.wmflabs, ever. It turns out that the IP in Designate for this is 172.16.6.245, but openstack-browser reveals something strange about the networking setup on this instance:

image.png (23×1 px, 8 KB)

It has two internal IP addresses listed, the second of them being 172.16.6.244. It turns out that IP *does* function:

krenair@cloud-cumin-01:~$ ssh 172.16.6.244
Permission denied (publickey).

(keys are broken there but that's a minor thing in comparison)

Why and how does this host have multiple internal IPs? What should happen if a host ends up with this?

Event Timeline

Nova seems to associate @crusnov with this server
Addresses data:

{
    'lan-flat-cloudinstances2b': [
        {
            'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:8e:7d:5e',
            'version': 4,
            'addr': '172.16.6.244',
            'OS-EXT-IPS:type': 'fixed'
        },
        {
            'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:f0:8e:7a',
            'version': 4,
            'addr': '172.16.6.245',
            'OS-EXT-IPS:type': 'fixed'
        }
    ]
}

@crusnov is this instance working at all? If not please could you try deleting it, and if needed, re-create?

@aborrero it looks like arturo-k8s-test-3.openstack.eqiad.wmflabs has also got this issue

This is an error that sometimes happens during VM creation -- I think it's something like...

  1. VM is scheduled
  2. IP is allocated by Neutron
  3. Scheduled VM fails to come up (possibly due to a cloudvirt being offline)
  4. VM is rescheduled
  5. IP is allocated by Neutron

etc.

As far as I know it only happens to brand new VMs so never damages any actual work in progress. And I don't know if the bug is still present in Newton.

Probably best to just delete the affected VM and wait and see if it happens again.

taavi subscribed.

VM no longer exists.