On 2020-05-21 we reimaged cloudnet servers from Stretch to Buster.
Since then, the prometheus node exporter has been collecting bogus metrics about nf_conntrack, probably due to the linux kernel being upgraded and changing how these values are shared between network namespaces. Important nf_conntrack values are inside the auto-generated neutron netnamespace, and therefore:
```lang=shell-session
aborrero@prometheus1003:~ $ curl cloudnet1003.eqiad.wmnet:9100/metrics | grep nf_conntrack
# HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking.
# TYPE node_nf_conntrack_entries gauge
node_nf_conntrack_entries 0
[..]
```
Our grafana panels showed no useful information: https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?panelId=32&fullscreen&orgId=1
A quick check in the server suggest we were having network problems indeed.
```lang=shell-session
aborrero@cloudnet1003:~ $ sudo dmesg -T | tail
[Sat Jul 4 15:35:48 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[Sat Jul 4 15:35:53 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[Sat Jul 4 15:57:29 2020] nf_conntrack: nf_conntrack: table full, dropping packet
[..]
```
Actionables:
[x] improve how we manage the values using puppet. Puppet won't check the live values in the kernel, just in the filesystem (/etc/sysctl.d/) We need some logic to ensure the desired values are actually read by the running kernel
[x] namespace awareness. Neutron works in autogenerated namespaces. The `nf_conntrack_buckets` setting works for all namespaces. The `nf_conntrack_max` setting is per-namespace, but if not specified it will be computed using `nf_conntrack_buckets * 4`.
[x] introduce new prometheus metrics, with namespace awareness
[] introduce icinga checks