Page MenuHomePhabricator

Virtualize NFS servers used exclusively by Cloud VPS tenants
Closed, DeclinedPublic

Description

Consider virtualizing NFS servers by converting labstore servers into cloudvirt servers with a single giant VM instance running on them. This will bring the NFS servers themselves into the internal 172.16.x.x address space and increase isolation from Wikimedia production networks and servers.

  • labstore1004 & labstore1005
  • cloudstore1008 & cloudstore1009 (which are the planned replacements for labstore1003)

Event Timeline

bd808 triaged this task as Normal priority.Feb 18 2019, 4:43 PM
bd808 created this task.
GTirloni added a subscriber: GTirloni.EditedFeb 21 2019, 12:15 PM

We need to be careful with huge QCOW2 files because moving them around will be really painful.

This will not be a problem once we have networked block storage, as the NFS servers would be just acting as app servers. In a way, networked block storage is a blocker for this task.

There is also a question about network throughput with the hypervisor's NIC being used to retrieve data from the distributed storage node and send it out to the NFS client. A dedicated backend network could alleviate these issues.

               A                    B
        +--------------+    +---------------+
        |              |    |               |
+-------v-----+    +---+----v-----+    +----+---------+
|             |    |              |    |              |
|  Ceph Node  |    |  NFS Server  |    |  NFS Client  |
|             |    |              |    |              |
+-------+-----+    +---^----+-----+    +----^---------+
        |              |    |               |
        +--------------+    +---------------+
               C                    D

NFS Server: A,B,C,D are flowing through a single 10GbE NIC in this diagram, potentially offering 25% in each direction. That's for a single VM, the hypervisor is likely to be running others.

It seems this ticket should be closed in light of the Ceph goal, right?

Bstorm added a subscriber: Bstorm.Feb 21 2019, 3:25 PM

Closed or just listed as blocked.

If listed as blocked, it probably would want to be totally rewritten. Probably closed lol.

GTirloni closed this task as Declined.Feb 21 2019, 3:58 PM

My thought process when writing this was not about using virtualization to turn our pets into cattle. It was about network isolation benefits of putting the NFS servers into the 172.16 network. I agree that evacuating enormous QCOW2 files to another cloudvirt would be functionally impossible.

Converting the base hardware to a cloudvirt and using virtualization for network isolation is the same thing that we are in the process of doing right now for the ToolsDB, OpenStreetMaps, and WikiLabels databases. Maybe we should get more experience with those instances however before we rush into repeating the pattern for other services.

Bstorm reopened this task as Open.Feb 21 2019, 11:44 PM

Opening up for discussion and consideration from that viewpoint.

I would like to give a reminder that we don't need to convert the hardware to a 'cloudvirt' server to have it available in the openstack instance network.
We could just hook an additional NIC to the 172.16 subnet/VLAN and then reserve that address in neutron for that concrete NIC. This was mentioned already somewhere by some of you, and just refreshing the idea here.

Quick and dirty diagram:

I would like to give a reminder that we don't need to convert the hardware to a 'cloudvirt' server to have it available in the openstack instance network.
We could just hook an additional NIC to the 172.16 subnet/VLAN and then reserve that address in neutron for that concrete NIC. This was mentioned already somewhere by some of you, and just refreshing the idea here.

Great point @aborrero. One question to consider if we took this approach: would having a storage server (NFS or otherwise) attached to both the production 10.x network and the cloud tenant 172.16.x network actually provide any isolation or protection to the prod network from possible attacks originating from the cloud tenant network?

Bstorm added a comment.Mar 1 2019, 8:15 PM

That's an interesting thought. Without digging into the security implications here, this would simply change where the management of the rules lies. I'm not sure it actually would be a gain for isolation. I think it would be a gain for convenience in a way that is perhaps not actually good.

If we consider the worst case, which is an attacker gaining control on the storage server proc, in the cloudvirt case there is still a hypervisor layer between the attacker and production network.
We could do a similar thing and isolate the storage server daemon/proc in the shared network case by using namespaces. That namespace layer would be the only thing between the compromised process and production network.

Kind of typical security tradeoff.

aborrero closed this task as Declined.Mar 19 2019, 5:11 PM

WMCS meeting result:

  • Openstack ironic recommended by Chase instead of trying to isolate dual-homed ourselves
  • Continue to serve NFS as we do today, and pospone this indifinetly.
  • NFS in general: brooke suspect of nfs-exportd interaction with LDAP may briefly cut connection with sge nodes