Change Details

On 2020-05-21 we reimaged cloudnet servers from Stretch to Buster. Since then, the prometheus node exporter has been collecting bogus metrics about nf_conntrack, probably due to the linux kernel being upgraded and changing how these values are shared between network namespaces. Important nf_conntrack values are inside the auto-generated neutron netnamespace, and therefore: ```lang=shell-session aborrero@prometheus1003:~ $ curl cloudnet1003.eqiad.wmnet:9100/metrics | grep nf_conntrack # HELP node_nf_conntrack_entries Number of currently allocated flow entries for connection tracking. # TYPE node_nf_conntrack_entries gauge node_nf_conntrack_entries 0 [..] ``` Our grafana panels showed no useful information: https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?panelId=32&fullscreen&orgId=1 A quick check in the server suggest we were having network problems indeed. ```lang=shell-session aborrero@cloudnet1003:~ $ sudo dmesg -T | tail [Sat Jul 4 15:35:48 2020] nf_conntrack: nf_conntrack: table full, dropping packet [Sat Jul 4 15:35:53 2020] nf_conntrack: nf_conntrack: table full, dropping packet [Sat Jul 4 15:57:29 2020] nf_conntrack: nf_conntrack: table full, dropping packet [..] ``` Actionables: [x] improve how we manage the values using puppet. Puppet won't check the live values in the kernel, just in the filesystem (/etc/sysctl.d/) We need some logic to ensure the desired values are actually read by the running kernel [x] namespace awareness. Neutron works in autogenerated namespaces. The `nf_conntrack_buckets` setting works for all namespaces. The `nf_conntrack_max` setting is per-namespace, but if not specified it will be computed using `nf_conntrack_buckets * 4`. [x] introduce new prometheus metrics, with namespace awareness [x] introduce icinga checks