Page MenuHomePhabricator

Neutron metadata service failing for all VMs
Closed, ResolvedPublic

Description

# curl http://169.254.169.254/openstack
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>

This isn't a big deal for running machines but is preventing the build of new ones.

I think this is related (or the same?) as the issue with metadata agents showing up as 'not alive' in openstack network agent list

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as High priority.May 31 2025, 11:22 PM

Stopping nova-api-metadata and all cloudcontrols doesn't affect the behavior. From this I conclude that the issue is upstream of the actual metadata service, something in neutron.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-05-31T23:24:06Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for service: project,neutron (T395742)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-05-31T23:36:15Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for service: project,neutron (T395742)

Mentioned in SAL (#wikimedia-cloud) [2025-05-31T23:38:06Z] <andrewbogott> failing over from cloudnet1005 to 1006 in hopes of unsticking T395742

That failover seems to have resolved things. There were a lot of rabbitmq connection errors in the logs, so my proposed explanation is that critical bits of neutron on 1005 were unable to learn about new VMs due to rabbit issues, and also didn't reconnect to rabbit which is an ongoing issue I see with oslo.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-05-31T23:44:06Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.restart_openstack on deployment eqiad1 for all services (T395742)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-05-31T23:51:55Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.restart_openstack (exit_code=0) on deployment eqiad1 for all services (T395742)

This is happening again, going to see if it's resolved the same way.

Actually, restarting the neutron metadata agent seems to have done the trick:

root@cloudnet1006:~# systemctl restart neutron-metadata-agent.service

I just had to restart the service again, in both eqiad1 and codfw1dev

Andrew renamed this task from Nova metadata service failing for all VMs to Neutron metadata service failing for all VMs.Jun 26 2025, 2:11 PM

I /think/ it is the same bug. But you're right, the bug as described more closely resembles T395255

The fix for T395255 did not resolve the intermittent crashes here.

Change #1172656 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] neutron metadata agent: restart service daily

https://gerrit.wikimedia.org/r/1172656

Change #1172656 merged by Andrew Bogott:

[operations/puppet@production] neutron metadata agent: restart service daily

https://gerrit.wikimedia.org/r/1172656

Change #1173352 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] openstack.neutron.metadata_agent: increase the number of open files

https://gerrit.wikimedia.org/r/1173352

Change #1173352 merged by David Caro:

[operations/puppet@production] openstack.neutron.metadata_agent: increase the number of open files

https://gerrit.wikimedia.org/r/1173352

Extended the open files limit for neutron-metadata-agent:

root@cloudnet1006:~# systemctl show neutron-metadata-agent.service  | grep LimitNOFILE
LimitNOFILE=67107840
LimitNOFILESoft=67107840

Let's see if that helps

Change #1173388 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] neutron metadata agent: restart service daily

https://gerrit.wikimedia.org/r/1173388

Change #1178062 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] neutron metadata agent: remove service restart

https://gerrit.wikimedia.org/r/1178062

Change #1173388 abandoned by Andrew Bogott:

[operations/puppet@production] neutron metadata agent: restart service daily

Reason:

not needed

https://gerrit.wikimedia.org/r/1173388

Change #1178062 merged by Andrew Bogott:

[operations/puppet@production] neutron metadata agent: remove service restart

https://gerrit.wikimedia.org/r/1178062

Andrew reassigned this task from Andrew to dcaro.