Page MenuHomePhabricator

codfw1dev has seen neutron metadata agents down since epoxy upgrade
Closed, ResolvedPublic

Description

taavi@cloudcontrol2004-dev ~ $ os network agent show 503a6978-1545-47e7-9272-8be3e1140825
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                                                  |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up    | UP                                                                                                                                                                                     |
| agent_type        | Metadata agent                                                                                                                                                                         |
| alive             | XXX                                                                                                                                                                                    |
| availability_zone | None                                                                                                                                                                                   |
| binary            | neutron-metadata-agent                                                                                                                                                                 |
| configuration     | {'log_agent_heartbeats': True, 'metadata_proxy_socket': '/var/lib/neutron/metadata_proxy', 'nova_metadata_host': 'openstack.codfw1dev.wikimediacloud.org', 'nova_metadata_port': 8775} |
| created_at        | 2022-04-25 21:41:53                                                                                                                                                                    |
| description       | None                                                                                                                                                                                   |
| ha_state          | None                                                                                                                                                                                   |
| host              | cloudnet2005-dev                                                                                                                                                                       |
| id                | 503a6978-1545-47e7-9272-8be3e1140825                                                                                                                                                   |
| last_heartbeat_at | 2025-05-07 19:46:13                                                                                                                                                                    |
| resources_synced  | None                                                                                                                                                                                   |
| started_at        | 2025-05-07 13:13:44                                                                                                                                                                    |
| topic             | N/A                                                                                                                                                                                    |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
taavi@cloudcontrol2004-dev ~ $ os network agent show ac55fc68-6811-43eb-9d1c-f0a22f42eb18
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field             | Value                                                                                                                                                                                  |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up    | UP                                                                                                                                                                                     |
| agent_type        | Metadata agent                                                                                                                                                                         |
| alive             | XXX                                                                                                                                                                                    |
| availability_zone | None                                                                                                                                                                                   |
| binary            | neutron-metadata-agent                                                                                                                                                                 |
| configuration     | {'log_agent_heartbeats': True, 'metadata_proxy_socket': '/var/lib/neutron/metadata_proxy', 'nova_metadata_host': 'openstack.codfw1dev.wikimediacloud.org', 'nova_metadata_port': 8775} |
| created_at        | 2022-04-25 21:41:03                                                                                                                                                                    |
| description       | None                                                                                                                                                                                   |
| ha_state          | None                                                                                                                                                                                   |
| host              | cloudnet2006-dev                                                                                                                                                                       |
| id                | ac55fc68-6811-43eb-9d1c-f0a22f42eb18                                                                                                                                                   |
| last_heartbeat_at | 2025-05-07 19:37:22                                                                                                                                                                    |
| resources_synced  | None                                                                                                                                                                                   |
| started_at        | 2025-05-07 13:13:24                                                                                                                                                                    |
| topic             | N/A                                                                                                                                                                                    |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The timing matches very closely with the epoxy upgrade, thus marking this a blocker for the eqiad1 upgrade.

Already tried restarting the agent etc.

Event Timeline

taavi triaged this task as Medium priority.May 26 2025, 2:02 PM
taavi added a project: Upstream.

I think this is a Neutron bug. https://review.opendev.org/c/openstack/neutron/+/942916, first included in Neutron 26.0.0 (i.e. Epoxy), removed the call to self._init_state_reporting() in neutron/agent/metadata/agent.py.

taavi removed taavi as the assignee of this task.May 27 2025, 11:50 AM

Just adding that call back doesn't quite work, my best guess is that's because oslo.service in Epoxy doesn't have a non-Eventlet backend but Neutron is already migrated to something else.

Thank you for noticing and logging this! Do you know what user-facing symptoms result from this issue? Seems like we need to add something to our network test suite.

Thank you for noticing and logging this! Do you know what user-facing symptoms result from this issue? Seems like we need to add something to our network test suite.

I'm not sure there are any user-facing things, the metadata server itself is functioning properly. However the Neutron cookbooks get upset since having an agent down makes them think the node is unhealthy and thus unsafe for any intrusive operations.

After discussion we are moving ahead with the Epoxy upgrade, but this is still of interest!

Zigo has built us some new Epoxy neutron packages ( 2:26.0.0-9~bpo12+1 ) which include the upstream fix for this. These packages are running on cloudnet200[56] and seem to resolve the issue. If things are stable there for a while I'll install the same packages in eqiad1.

It remains to be seen if this fixes the random crashes in T395742; seems unlikely, although I'm hopeful that the large threading rewrite in F will fix it.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T16:02:03Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1005.eqiad.wmnet' (T395255)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T16:11:25Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1005.eqiad.wmnet' (T395255)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T16:16:34Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node on host 'cloudnet1006.eqiad.wmnet' (T395255)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-21T16:25:41Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) on host 'cloudnet1006.eqiad.wmnet' (T395255)

Andrew claimed this task.

I've installed the latest epoxy/neutron packages on cloudnet hosts.

This wasn't applied yet on codfw1dev but now it is.