Page MenuHomePhabricator

cloud: drop NAT exception for dumps NFS
Closed, ResolvedPublic

Description

Cloud clients consume read-only NFS data from dumps servers (labstore1006/labstore1007).

There are no file-locking facilities or the like. Also we don't do any special ratelimit per client, so there is apparently no need for a NAT exception, and all clients can be seen by dumps servers as coming from 185.15.56.1.

Event Timeline

aborrero triaged this task as Medium priority.Jan 19 2021, 4:16 PM
aborrero created this task.
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 657152 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] [DONT MERGE] cloud: drop NAT exceptions for dumps NFS

https://gerrit.wikimedia.org/r/657152

hey @Bstorm when this change is applied, most client connections will fail.

I wonder if we should figure out a way to tell NFS clients to re-initiate connections to the server or if the NFS client libs would be smart enough to do that for us.

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T11:45:05Z] <arturo> [codfw1dev] create instance tools-codfw1dev-bastion-1 in tools-codfw1dev to test stuff related to T272397

Mentioned in SAL (#wikimedia-cloud) [2021-02-11T11:59:01Z] <arturo> [codfw1dev] create instance tools-codfw1dev-bastion-2 (stretch) in tools-codfw1dev to test stuff related to T272397

Change 663593 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project

https://gerrit.wikimedia.org/r/663593

Change 663593 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project

https://gerrit.wikimedia.org/r/663593

It seems we send the client address to the NFS server?

aborrero@tools-codfw1dev-k8s-worker-1:~$ sudo /usr/local/sbin/nfs-mount-manager mount /mnt/nfs/dumps-labstore1007.wikimedia.org
mounting /mnt/nfs/dumps-labstore1007.wikimedia.org
mount.nfs: trying text-based options 'vers=4.2,bg,intr,sec=sys,proto=tcp,lookupcache=all,nofsc,soft,timeo=300,retrans=3,addr=208.80.155.106,clientaddr=172.16.128.157'
mount.nfs: mount(2): Permission denied
mount.nfs: access denied by server while mounting labstore1007.wikimedia.org:

Change 663603 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks

https://gerrit.wikimedia.org/r/663603

Change 663603 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks

https://gerrit.wikimedia.org/r/663603

Change 664231 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] dumps: distribution: nfs: allow establishing connections with TCP ports > 1024

https://gerrit.wikimedia.org/r/664231

Change 664231 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] dumps: distribution: nfs: allow establishing connections with TCP ports >= 1024

https://gerrit.wikimedia.org/r/664231

Change 664233 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labstore: allow NFS connections from public cloud networks

https://gerrit.wikimedia.org/r/664233

Change 664233 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] labstore: allow NFS connections from public cloud networks

https://gerrit.wikimedia.org/r/664233

The change is ready. I will coordinate with @Bstorm for an operation window soon.

We decided to:

  • send a brief heads up message to community mailing lists about this change
  • schedule an operation window next week (probably 2020-02-23)

Change 657152 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: drop NAT exceptions for dumps NFS

https://gerrit.wikimedia.org/r/657152

aborrero claimed this task.

This is done, we merged the puppet change, plus:

  • we don't reload neutron-l3-agent on puppet changes (on purpose, to avoid failover noise)
  • in cloudnet servers manually delete iptables rules that were previously implementing the NAT exception
aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a bash
root@cloudnet1003:~ # iptables -t nat -D neutron-l3-agent-POSTROUTING -s 172.16.0.0/21 -d 208.80.154.7/32 -j ACCEPT
root@cloudnet1003:~ # iptables -t nat -D neutron-l3-agent-POSTROUTING -s 172.16.0.0/21 -d 208.80.155.106/32 -j ACCEPT
  • then delete the conntrack entries to ensure connections don't reuse the information, force new connections to use the NAT
aborrero@cloudnet1003:~ $ sudo ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a bash
root@cloudnet1003:~ # conntrack -L --dst 208.80.154.7 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -D --dst 208.80.154.7 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -L --dst 208.80.155.106 -p tcp --dport 2049
[..]
root@cloudnet1003:~ # conntrack -D --dst 208.80.155.106 -p tcp --dport 2049
[..]
  • watch new NATed connections in tcpdump in labstore1006/7:
aborrero@labstore1007:~ $ sudo tcpdump -i any -e "tcp port 2049 and tcp[tcpflags] & (tcp-syn|tcp-ack) != 0 and host 185.15.56.1"
[..]
  • also watch old connection no longer happening (as server reconnect):
aborrero@labstore1007:~ $ sudo tcpdump -i any net 172.16.0.0/16
[..]

In case of rollback:

  • Quick and dirty: merge a revert of https://gerrit.wikimedia.org/r/c/operations/puppet/+/657152 run puppet agent in cloudnet servers and restart neutron-l3-agent (this will cause a failover, noisy netwrok hiccup)
  • Slow and smooth: merge a revert of the patch above. Don't restart neutron-l3-agent to avoid a failover. Create iptables rules in the right position in the ruleset (you have to calculate where) inside the neutron netns. This won't cause a failover.
  • monitor with tcpdump that things make sense.

Many thanks to @Bstorm for assistance during the operation.

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T23:06:54Z] <bstorm> draining tools-k8s-worker-55 to clean up after dumps changes T272397

Mentioned in SAL (#wikimedia-cloud) [2021-02-23T23:11:46Z] <bstorm> draining a bunch of k8s workers to clean up after dumps changes T272397