Page MenuHomePhabricator

neutron-keepalived-state-change debug turned on, filling logs
Closed, ResolvedPublic

Description

The neutron.conf file does not have debug logging turned on. Nonetheless, syslog is full of things like this:

Jan 18 00:16:57 cloudnet1005 neutron-keepalived-state-change: 2023-01-18 00:16:57.922 3355 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-d93771ba-2711-4f88-804a-8df6fd03978a; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 3, 'state': 16, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '172.16.7.73'), ('NDA_LLADDR', 'fa:16:3e:5f:59:1d'), ('NDA_PROBES', 1), ('NDA_CACHEINFO', {'ndm_confirmed': 4582, 'ndm_used': 241, 'ndm_updated': 0, 'ndm_refcnt': 2})], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-d93771ba-2711-4f88-804a-8df6fd03978a', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482

So full that the drive nearly filled on on cloudnet1005. I just failed over to 1006 and now the drive is filling up there as well.

I don't think anything is actually going wrong, other than the crazy loglines.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as Unbreak Now! priority.Jan 20 2023, 2:25 AM

I papered over this for the night by marking out the debug log lines in neutron/agent/linux/ip_lib.py

This still needs a real fix; I can't at all tell why debug is turned on and it's getting too late for me to sort out tonight. With the hack in place we won't fill up the drive before I have time to dig deeper.

Some good news is that the same is happening in codfw, so at least the error is consistent across the fleet.

I've reached a couple of conclusions:

  • the flood of logs mostly happen on the l3 agent that is acting as 'active'.
  • I suspect the log messages are legit. Meaning there is something unhealthy thing going on somewhere. It is difficult to pinpoint a given moment in time because journalctl logs have been trimmed.
  • the neutron-keepalived-state-change component runs in its own thread apparently. Some misconfiguration regarding the logging module might be present somewhere. Likely an upstream bug. I see from older logs that every time there is a message from that component, it is a DEBUG message, even in healthy scenarios.
  • the 'standby' node shows mostly rabbitmq connectivity problems in the journalctl logs. So this may as well be rabbitmq at the root of the problem.

Mentioned in SAL (#wikimedia-cloud) [2023-01-20T10:12:26Z] <arturo> [codfw1dev] failover neutron-l3-agent between cloudnet2005-dev/cloudnet2006-dev a couple of times T327463

Regarding why neutron-keepaliaved-state-change is so busy, I think iot is responding to normal ARP/neigbourg network changes.

This can be observed by monitoring the netlink events on the l3 router namespace, and correlating the events with what neutron-keepaliaved-state-change is reporting:

aborrero@cloudnet2005-dev:~ $ sudo journalctl -u neutron-l3-agent -f & sudo ip netns exec qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad ip monitor all
[NEIGH]172.16.128.141 dev qr-21e10025-d4 lladdr fa:16:3e:ec:1b:ee PROBE
[NEIGH]172.16.128.19 dev qr-21e10025-d4 lladdr fa:16:3e:81:56:dc PROBE
[NEIGH]172.16.128.19 dev qr-21e10025-d4 lladdr fa:16:3e:81:56:dc REACHABLE
[NEIGH]172.16.128.141 dev qr-21e10025-d4 lladdr fa:16:3e:ec:1b:ee REACHABLE
[NEIGH]185.15.57.23 dev qg-1290224c-b1  FAILED
Jan 20 11:30:35 cloudnet2005-dev neutron-keepalived-state-change[3397425]: 2023-01-20 11:30:35.074 3397425 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 3, 'state': 16, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '172.16.128.141'), ('NDA_LLADDR', 'fa:16:3e:ec:1b:ee'), ('NDA_PROBES', 1), ('NDA_CACHEINFO', {'ndm_confirmed': 3071, 'ndm_used': 316, 'ndm_updated': 0, 'ndm_refcnt': 2})], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482
Jan 20 11:30:35 cloudnet2005-dev neutron-keepalived-state-change[3397425]: 2023-01-20 11:30:35.075 3397425 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 3, 'state': 16, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '172.16.128.19'), ('NDA_LLADDR', 'fa:16:3e:81:56:dc'), ('NDA_PROBES', 1), ('NDA_CACHEINFO', {'ndm_confirmed': 2176, 'ndm_used': 8, 'ndm_updated': 0, 'ndm_refcnt': 2})], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482
Jan 20 11:30:35 cloudnet2005-dev neutron-keepalived-state-change[3397425]: 2023-01-20 11:30:35.076 3397425 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 3, 'state': 2, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '172.16.128.19'), ('NDA_LLADDR', 'fa:16:3e:81:56:dc'), ('NDA_PROBES', 1), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 8, 'ndm_updated': 0, 'ndm_refcnt': 2})], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482
Jan 20 11:30:35 cloudnet2005-dev neutron-keepalived-state-change[3397425]: 2023-01-20 11:30:35.081 3397425 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 3, 'state': 2, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '172.16.128.141'), ('NDA_LLADDR', 'fa:16:3e:ec:1b:ee'), ('NDA_PROBES', 1), ('NDA_CACHEINFO', {'ndm_confirmed': 0, 'ndm_used': 316, 'ndm_updated': 0, 'ndm_refcnt': 2})], 'header': {'length': 76, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482
Jan 20 11:30:35 cloudnet2005-dev neutron-keepalived-state-change[3397425]: 2023-01-20 11:30:35.106 3397425 DEBUG neutron.agent.linux.ip_lib [-] IP monitor qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad; Adding IP address: {'family': 2, '__pad': (), 'ifindex': 4, 'state': 32, 'flags': 0, 'ndm_type': 1, 'attrs': [('NDA_DST', '185.15.57.23'), ('NDA_PROBES', 6), ('NDA_CACHEINFO', {'ndm_confirmed': 480686, 'ndm_used': 307, 'ndm_updated': 0, 'ndm_refcnt': 1})], 'header': {'length': 64, 'type': 28, 'flags': 0, 'sequence_number': 0, 'pid': 0, 'error': None, 'target': 'qrouter-5712e22e-134a-40d3-a75a-1c9b441717ad', 'stats': Stats(qsize=0, delta=0, delay=0)}, 'event': 'RTM_NEWNEIGH'} to the queue. read_ip_updates /usr/lib/python3/dist-packages/neutron/agent/linux/ip_lib.py:1482

After reading the code, I think I understand what is going on here:

This seems like normal operations to me. There is no abnormal functions here other than the forced debug log.

Will patch that.

What I can't explain is why this didn't happen until today.

Change 881866 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: neutron: keepalived_state_change.py to avoid debug log flood

https://gerrit.wikimedia.org/r/881866

Change 881866 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: neutron: keepalived_state_change.py to avoid debug log flood

https://gerrit.wikimedia.org/r/881866

Mentioned in SAL (#wikimedia-cloud) [2023-01-20T13:15:51Z] <arturo> reinstall python3-neutron (to reset manual patching) on all cloudnet nodes and patch it via puppet, then restart neutron-l3-agent by hand (T327463)

aborrero lowered the priority of this task from Unbreak Now! to High.

I'm pretty sure that this has been happening since the Z upgrade, and I just didn't notice until the disk-space monitoring triggered.