Page MenuHomePhabricator

af-nb-db-2.automation-framework.eqiad.wmflabs has broken network
Open, Needs TriagePublic

Description

For T232429: Create in-cloud, cloud-vps-wide cumin masters I was looking at what hosts are *not* currently responding to cumin. One of them is this host with this error:
ssh: connect to host af-nb-db-2.automation-framework.eqiad.wmflabs port 22: No route to host
Now that's interesting, that should never happen for a host under eqiad.wmflabs, ever. It turns out that the IP in Designate for this is 172.16.6.245, but openstack-browser reveals something strange about the networking setup on this instance:


It has two internal IP addresses listed, the second of them being 172.16.6.244. It turns out that IP *does* function:

krenair@cloud-cumin-01:~$ ssh 172.16.6.244
Permission denied (publickey).

(keys are broken there but that's a minor thing in comparison)

Why and how does this host have multiple internal IPs? What should happen if a host ends up with this?

Event Timeline

Krenair created this task.Sep 11 2019, 9:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2019, 9:01 PM
Krenair added a subscriber: crusnov.EditedSep 11 2019, 9:11 PM

Nova seems to associate @crusnov with this server
Addresses data:

{
    'lan-flat-cloudinstances2b': [
        {
            'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:8e:7d:5e',
            'version': 4,
            'addr': '172.16.6.244',
            'OS-EXT-IPS:type': 'fixed'
        },
        {
            'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:f0:8e:7a',
            'version': 4,
            'addr': '172.16.6.245',
            'OS-EXT-IPS:type': 'fixed'
        }
    ]
}
Krenair updated the task description. (Show Details)Sep 18 2019, 8:13 PM

@crusnov is this instance working at all? If not please could you try deleting it, and if needed, re-create?

@aborrero it looks like arturo-k8s-test-3.openstack.eqiad.wmflabs has also got this issue

Andrew added a subscriber: Andrew.Thu, Oct 10, 9:37 PM

This is an error that sometimes happens during VM creation -- I think it's something like...

  1. VM is scheduled
  2. IP is allocated by Neutron
  3. Scheduled VM fails to come up (possibly due to a cloudvirt being offline)
  4. VM is rescheduled
  5. IP is allocated by Neutron

etc.

As far as I know it only happens to brand new VMs so never damages any actual work in progress. And I don't know if the bug is still present in Newton.

Probably best to just delete the affected VM and wait and see if it happens again.