Page MenuHomePhabricator

cloud: drop NAT exception for dumps NFS
Closed, ResolvedPublic

Description

Cloud clients consume read-only NFS data from dumps servers (labstore1006/labstore1007).

There are no file-locking facilities or the like. Also we don't do any special ratelimit per client, so there is apparently no need for a NAT exception, and all clients can be seen by dumps servers as coming from 185.15.56.1.

Event Timeline

aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 657152 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [DONT MERGE] cloud: drop NAT exceptions for dumps NFS

https://gerrit.wikimedia.org/r/657152

hey @Bstorm when this change is applied, most client connections will fail.

I wonder if we should figure out a way to tell NFS clients to re-initiate connections to the server or if the NFS client libs would be smart enough to do that for us.

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T11:45:05Z] <arturo> [codfw1dev] create instance tools-codfw1dev-bastion-1 in tools-codfw1dev to test stuff related to T272397

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T11:59:01Z] <arturo> [codfw1dev] create instance tools-codfw1dev-bastion-2 (stretch) in tools-codfw1dev to test stuff related to T272397

Change 663593 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project

https://gerrit.wikimedia.org/r/663593

Change 663593 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project

https://gerrit.wikimedia.org/r/663593

It seems we send the client address to the NFS server?

aborrero@tools-codfw1dev-k8s-worker-1:~$ sudo /usr/local/sbin/nfs-mount-manager mount /mnt/nfs/dumps-labstore1007.wikimedia.org
mounting /mnt/nfs/dumps-labstore1007.wikimedia.org
mount.nfs: trying text-based options 'vers=4.2,bg,intr,sec=sys,proto=tcp,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=208.80.155.106,clientaddr=172.16.128.157'
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting labstore1007.wikimedia.org:

Change 663603 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks

https://gerrit.wikimedia.org/r/663603

Change 663603 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks

https://gerrit.wikimedia.org/r/663603

Change 664231 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dumps: distribution: nfs: allow establishing connections with TCP ports > 1024

https://gerrit.wikimedia.org/r/664231

Change 664231 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024

https://gerrit.wikimedia.org/r/664231

Change 664233 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labstore: allow NFS connections from public cloud networks

https://gerrit.wikimedia.org/r/664233

Change 664233 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labstore: allow NFS connections from public cloud networks

https://gerrit.wikimedia.org/r/664233

The change is ready. I will coordinate with @Bstorm for an operation window soon.

We decided to:

  • send a brief heads up message to community mailing lists about this change
  • schedule an operation window next week (probably 2020-02-23)

Change 657152 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: drop NAT exceptions for dumps NFS

https://gerrit.wikimedia.org/r/657152

aborrero claimed this task.

This is done, we merged the puppet change, plus:

  • we don't reload neutron-l3-agent on puppet changes (on purpose, to avoid failover noise)
  • in cloudnet servers manually delete iptables rules that were previously implementing the NAT exception
aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a bash
root@cloudnet1003:~ # iptables -t nat -D neutron-l3-agent-POSTROUTING -s 172.16.0.0/21 -d 208.80.154.7/32 -j ACCEPT
root@cloudnet1003:~ # iptables -t nat -D neutron-l3-agent-POSTROUTING -s 172.16.0.0/21 -d 208.80.155.106/32 -j ACCEPT
  • then delete the conntrack entries to ensure connections don't reuse the information, force new connections to use the NAT
aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a bash
root@cloudnet1003:~ # conntrack -L --dst 208.80.154.7 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -D --dst 208.80.154.7 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -L --dst 208.80.155.106 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -D --dst 208.80.155.106 -p tcp --dport 2049
[..]
  • watch new NATed connections in tcpdump in labstore1006/7:
aborrero@labstore1007:~ $ sudo tcpdump -i any -e "tcp port 2049 and tcp[tcpflags] & (tcp-syn|tcp-ack) != 0 and host 185.15.56.1"
[..]
  • also watch old connection no longer happening (as server reconnect):
aborrero@labstore1007:~ $ sudo tcpdump -i any net 172.16.0.0/16
[..]

In case of rollback:

  • Quick and dirty: merge a revert of https://gerrit.wikimedia.org/r/c/operations/puppet/+/657152 run puppet agent in cloudnet servers and restart neutron-l3-agent (this will cause a failover, noisy netwrok hiccup)
  • Slow and smooth: merge a revert of the patch above. Don't restart neutron-l3-agent to avoid a failover. Create iptables rules in the right position in the ruleset (you have to calculate where) inside the neutron netns. This won't cause a failover.
  • monitor with tcpdump that things make sense.

Many thanks to @Bstorm for assistance during the operation.

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T23:06:54Z] <bstorm> draining tools-k8s-worker-55 to clean up after dumps changes T272397

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T23:11:46Z] <bstorm> draining a bunch of k8s workers to clean up after dumps changes T272397