Page MenuHomePhabricator

cloudnet: prometheus node exporter stopped collecting nf_conntrack_entries metric
Closed, ResolvedPublic

Description

On 2020-05-21 we reimaged cloudnet servers from Stretch to Buster.

Since then, the prometheus node exporter has been collecting bogus metrics about nf_conntrack, probably due to the linux kernel being upgraded and changing how these values are shared between network namespaces. Important nf_conntrack values are inside the auto-generated neutron netnamespace, and therefore:

aborrero@prometheus1003:~ $ curl cloudnet1003.eqiad.wmnet:9100/metrics | grep nf_conntrack
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 0
[..]

Our grafana panels showed no useful information: https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?panelId=32&fullscreen&orgId=1

A quick check in the server suggest we were having network problems indeed.

aborrero@cloudnet1003:~ $ sudo dmesg -T | tail
[Sat Jul  4 15:35:48 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[Sat Jul  4 15:35:53 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[Sat Jul  4 15:57:29 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[..]

Actionables:

  • improve how we manage the values using puppet. Puppet won't check the live values in the kernel, just in the filesystem (/etc/sysctl.d/) We need some logic to ensure the desired values are actually read by the running kernel
  • namespace awareness. Neutron works in autogenerated namespaces. The nf_conntrack_buckets setting works for all namespaces. The nf_conntrack_max setting is per-namespace, but if not specified it will be computed using nf_conntrack_buckets * 4.
  • introduce new prometheus metrics, with namespace awareness
  • introduce icinga checks

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-07-09T09:16:07Z] <arturo> manually increasing sysctl value of net.nf_conntrack_max in cloudnet servers (T257552)

note we might be interested only in the conntrack inside the neutron router netns:

aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a sysctl -a | egrep nf_conntrack_count\|nf_conntrack_max
net.netfilter.nf_conntrack_count = 172198
net.netfilter.nf_conntrack_max = 1048576
aborrero@cloudnet1003:~ $ sudo sysctl -a | egrep nf_conntrack_count\|nf_conntrack_max
net.netfilter.nf_conntrack_count = 0
net.netfilter.nf_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
aborrero triaged this task as High priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

I confirm this started happening when we reimaged cloudnet servers to buster in 2020-05-21.

Also, the sysctl config we deploy via puppet (modules/openstack/manifests/neutron/l3_agent.pp) just deploys the file in the filesystem but doesn't apply it.

aborrero renamed this task from prometheus: node exporter stopped collecting nf_conntrack_entries metric to cloudnet: prometheus node exporter stopped collecting nf_conntrack_entries metric.Jul 9 2020, 10:46 AM

Mentioned in SAL (#wikimedia-cloud) [2020-07-09T11:11:53Z] <arturo> [codfw1dev] rebooting cloudnet2003-dev for testing sysct/puppet behavior (T257552)

Mentioned in SAL (#wikimedia-cloud) [2020-07-09T11:23:47Z] <arturo> [codfw1dev] rebooting cloudnet2003-dev again for testing sysct/puppet behavior (T257552)

Change 610775 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: network nodes: explicitly load nf_conntrack module on boot

https://gerrit.wikimedia.org/r/610775

Change 610775 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: network nodes: explicitly load nf_conntrack module on boot

https://gerrit.wikimedia.org/r/610775

aborrero lowered the priority of this task from High to Medium.Jul 9 2020, 11:56 AM

There's a few other issues I am suddenly wondering if this has been affecting.

There's a few other issues I am suddenly wondering if this has been affecting.

Yes, this could have been silently affecting other stuff. Which ones do you have in mind? do we have phab tasks for them?

Change 611262 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: add prometheus neutron conntrack collector

https://gerrit.wikimedia.org/r/611262

Change 611262 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: add prometheus neutron conntrack collector

https://gerrit.wikimedia.org/r/611262

Change 611278 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: prometheus neutron collector: add "" characters for label values

https://gerrit.wikimedia.org/r/611278

Change 611278 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: prometheus neutron collector: add "" characters for label values

https://gerrit.wikimedia.org/r/611278

Change 611285 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: prometheus neutron exporter: cleanup log messages

https://gerrit.wikimedia.org/r/611285

Change 611285 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: prometheus neutron exporter: cleanup log messages

https://gerrit.wikimedia.org/r/611285

I was thinking of T249035: Requests to production are sometimes timing out or giving empty response because it's mysterious. It's hard to find phabs tasks for other bits of strangeness because they have tended to be scattered and possibly transient.

I was thinking of T249035: Requests to production are sometimes timing out or giving empty response because it's mysterious. It's hard to find phabs tasks for other bits of strangeness because they have tended to be scattered and possibly transient.

The timeline doesn't match. That ticket was opened on 2020-03-31 and I'm pretty sure the issue in here started on 2020-05-21.

Change 612390 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: add NRPE plugin to check nf_conntrack status

https://gerrit.wikimedia.org/r/612390

Mentioned in SAL (#wikimedia-cloud) [2020-07-14T10:43:05Z] <arturo> icinga downtime cloudnet* hosts for 30 mins to introduce new check https://gerrit.wikimedia.org/r/c/operations/puppet/+/612390 (T257552)

Change 612390 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: add NRPE plugin to check nf_conntrack status

https://gerrit.wikimedia.org/r/612390

Change 612537 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: node_neutron_namespace: disable explicit monitoring

https://gerrit.wikimedia.org/r/612537

Change 612537 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: node_neutron_namespace: disable explicit monitoring

https://gerrit.wikimedia.org/r/612537

Change 612545 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: monitor: neutron: nf_conntrack: run the NRPE check as root using sudo

https://gerrit.wikimedia.org/r/612545

Change 612545 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: monitor: neutron: nf_conntrack: run the NRPE check as root using sudo

https://gerrit.wikimedia.org/r/612545

I have some doubts about converting the new icinga check to critical (and let it page). I'm just unsure, so I will leave it as is for now (non-critical) and see how things evolve.
@Bstorm might have an opinion.

Other than that, all the work here is done, so closing task now.