With the change in the parent task https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis we could potentially introduce NAT overflow that could result in traffic being dropped.
aborrero@cloudnet1004:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a conntrack -L --dst 184.108.40.206 | wc -l conntrack v1.4.5 (conntrack-tools): 21527 flow entries have been shown. 21527
At very least, as first counter-measurement we should introduce some metrics to be able to check this situation. Some alerts could also be interesting, but such alerts wouldn't be actionable.
Anyway the root thing here is that something in the network architecture is wrong. There are potentially at least 2 absolute solutions to address this:
- introduce tenant networks, each tenant with its own NAT router. Something we can't do with our current neutron setup.
- introduce IPv6, and have all cloud -> wiki traffic be natively IPv6 without NAT
We were already aware of this, that's why we were working on T270704: cloud: introduce new edge network architecture for eqiad1 and codfw1dev (https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh)