Page MenuHomePhabricator

MaxConntrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1043:9100
Closed, ResolvedPublic

Description

Common information

  • alertname: MaxConntrack
  • cluster: wmcs
  • instance: cloudvirt1043:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

taavi added a project: Cloud-VPS.
taavi subscribed.

This is a repeat of T355061: MaxConnTrack Netfilter: Maximum number of allowed connection tracking entries alert on cloudvirt1060:9100 and I assume it's one of the migrated VMs that's causing it:

taavi@cloudcontrol1006 ~ $ os server migration list --changes-since 2024-01-15T00:00:00Z --host cloudvirt1043 --status completed
+-------+--------------------------------------+---------------------------+---------------------------+----------------+---------------+-----------+-----------+--------------------------------------+------------+------------+----------------+----------------------------+----------------------------+
|    Id | UUID                                 | Source Node               | Dest Node                 | Source Compute | Dest Compute  | Dest Host | Status    | Server UUID                          | Old Flavor | New Flavor | Type           | Created At                 | Updated At                 |
+-------+--------------------------------------+---------------------------+---------------------------+----------------+---------------+-----------+-----------+--------------------------------------+------------+------------+----------------+----------------------------+----------------------------+
| 35745 | 4f34ee5c-03eb-44af-977b-0de364673bd9 | cloudvirt1060.eqiad.wmnet | cloudvirt1043.eqiad.wmnet | cloudvirt1060  | cloudvirt1043 | None      | completed | 4f193824-85d8-4369-8be3-c8b96abbd71d |        148 |        148 | live-migration | 2024-01-16T11:18:04.000000 | 2024-01-16T11:18:40.000000 |
| 35742 | 811df44e-8d92-4e11-aafc-dcf5187e2b5e | cloudvirt1060.eqiad.wmnet | cloudvirt1043.eqiad.wmnet | cloudvirt1060  | cloudvirt1043 | None      | completed | 8bb7461b-2cb1-4a23-9405-183955a3fb4e |        248 |        248 | live-migration | 2024-01-16T11:17:48.000000 | 2024-01-16T11:18:06.000000 |
| 35739 | 93f2daf2-fb32-4f68-bf07-b04aefaac2c7 | cloudvirt1060.eqiad.wmnet | cloudvirt1043.eqiad.wmnet | cloudvirt1060  | cloudvirt1043 | None      | completed | 0b5efef1-d5d1-402a-882a-7a27b6ced0a3 |        248 |        248 | live-migration | 2024-01-16T11:17:34.000000 | 2024-01-16T11:17:50.000000 |
| 35733 | 8ed82416-f273-4cbd-af85-c07e98e37dc5 | cloudvirt1060.eqiad.wmnet | cloudvirt1043.eqiad.wmnet | cloudvirt1060  | cloudvirt1043 | None      | completed | 9178db63-751e-4727-8b9c-4b58f44cd214 |        251 |        251 | live-migration | 2024-01-16T11:16:49.000000 | 2024-01-16T11:17:14.000000 |
| 35730 | 7ddb7d13-fb07-4902-bd36-b1ed17c370ce | cloudvirt1060.eqiad.wmnet | cloudvirt1043.eqiad.wmnet | cloudvirt1060  | cloudvirt1043 | None      | completed | dba77a95-d482-43af-853a-7b72b309dd9e |        248 |        248 | live-migration | 2024-01-16T11:16:34.000000 | 2024-01-16T11:16:50.000000 |
+-------+--------------------------------------+---------------------------+---------------------------+----------------+---------------+-----------+-----------+--------------------------------------+------------+------------+----------------+----------------------------+----------------------------+

taavi@cloudcontrol1006 ~ $ os server show 4f193824-85d8-4369-8be3-c8b96abbd71d -c project_id -c hostname
+------------+---------------------+
| Field      | Value               |
+------------+---------------------+
| hostname   | tools-k8s-worker-52 |
| project_id | tools               |
+------------+---------------------+
taavi@cloudcontrol1006 ~ $ os server show 8bb7461b-2cb1-4a23-9405-183955a3fb4e -c project_id -c hostname
+------------+--------------------------+
| Field      | Value                    |
+------------+--------------------------+
| hostname   | toolsbeta-sgegrid-shadow |
| project_id | toolsbeta                |
+------------+--------------------------+
taavi@cloudcontrol1006 ~ $ os server show 0b5efef1-d5d1-402a-882a-7a27b6ced0a3 -c project_id -c hostname
+------------+-----------+
| Field      | Value     |
+------------+-----------+
| hostname   | mailman03 |
| project_id | mailman   |
+------------+-----------+
taavi@cloudcontrol1006 ~ $ os server show 9178db63-751e-4727-8b9c-4b58f44cd214 -c project_id -c hostname
+------------+-------------------------+
| Field      | Value                   |
+------------+-------------------------+
| hostname   | deployment-kafka-main-6 |
| project_id | deployment-prep         |
+------------+-------------------------+
taavi@cloudcontrol1006 ~ $ os server show dba77a95-d482-43af-853a-7b72b309dd9e -c project_id -c hostname
+------------+----------------------+
| Field      | Value                |
+------------+----------------------+
| hostname   | diffscan02           |
| project_id | automation-framework |
+------------+----------------------+

Out of those my guess would be diffscan02. Anyhow T139598: Depleted connection tracking table on labvirt1010 had already raised the limits in 2016, and I think it might be reasonable to just increase the limit again.

Change 991346 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: nova::compute: increase max conntrack table size

https://gerrit.wikimedia.org/r/991346

Change 991346 merged by Majavah:

[operations/puppet@production] P:openstack: nova::compute: increase max conntrack table size

https://gerrit.wikimedia.org/r/991346

I would expect to see a high number of open connections in one of those hosts, but diffscan02 has just 6 at the moment, though it might have bursts.

Maybe we can check if any hosts has some conntrack error in the logs? According to this post when a packet is dropped it should generate a log line like

nf_conntrack: table full, dropping packet

I found another way that confirms your guess is correct and most connections are from diffscan02 (172.16.3.44):

root@cloudvirt1043:~# conntrack -L |grep 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 278055 flow entries have been shown.
271683
root@cloudvirt1043:~# conntrack -L |grep -v 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 277132 flow entries have been shown.
6412

I found another way that confirms your guess is correct and most connections are from diffscan02 (172.16.3.44):

root@cloudvirt1043:~# conntrack -L |grep 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 278055 flow entries have been shown.
271683
root@cloudvirt1043:~# conntrack -L |grep -v 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 277132 flow entries have been shown.
6412

Nice, I was playing with ss.

The fact that the number of flow entries shows is more or less the same even if specifics for diffscan02 means anything?

Maybe we can check if any hosts has some conntrack error in the logs? According to this post when a packet is dropped it should generate a log line like

nf_conntrack: table full, dropping packet

I would suspect the current check for conntrack being full would be enough?

I would suspect the current check for conntrack being full would be enough?

Yep sorry, I was wrongly assuming that error would show up in the VM, and it would allow us to identify which VM was filling the table.

The fact that the number of flow entries shows is more or less the same even if specifics for diffscan02 means anything?

Can you rephrase that? I'm not sure I understand what you mean.

The fact that the number of flow entries shows is more or less the same even if specifics for diffscan02 means anything?

Can you rephrase that? I'm not sure I understand what you mean.

Sorry, I'm a bit all over the place, I mean that maybe diffscan02 is not the only one adding connections, as the total seems not to have changed too much even when diffscan02 passed from 270k to 6k

diffscan02 was never at 6k afaict, 6k in my previous comment was the total number excluding that host (grep -v).

diffscan02 now went down from 271k to 9 (without a k!) and the total went down accordingly:

root@cloudvirt1043:~# conntrack -L |grep 172.16.3.44 |wc -l
conntrack v1.4.7 (conntrack-tools): 6534 flow entries have been shown.
9
root@cloudvirt1043:~# conntrack -L |wc -l
conntrack v1.4.7 (conntrack-tools): 6310 flow entries have been shown.
6310

Also confirmed by Grafana: https://grafana.wikimedia.org/goto/JIE_C25Ik?orgId=1