Virtualize NFS servers used exclusively by Cloud VPS tenants
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	bd808
	Feb 18 2019, 4:43 PM

Description

In the interest of preserving history, I'm resurrecting this task and proposing at least part of the design.

NFS services provided by VMs on Ceph using an attached volume (cinder). The VMs would implement quotas per project and would have likely implement the current server layout of labstore1004/5. They would need to operate on quiet or dedicated-ish hardware and would merit some stress testing before migration of data and mounts. If cinder isn't used to provide Kubernetes volumes directly, such VMs could also provide that via NFS.

Original task description:

Consider virtualizing NFS servers by converting labstore servers into cloudvirt servers with a single giant VM instance running on them. This will bring the NFS servers themselves into the internal 172.16.x.x address space and increase isolation from Wikimedia production networks and servers.

labstore1004 & labstore1005

cloudstore1008 & cloudstore1009 (which are the planned replacements for labstore1003)

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
Declined	None	T216422 Virtualize NFS servers used exclusively by Cloud VPS tenants

Event Timeline

bd808 triaged this task as Medium priority.Feb 18 2019, 4:43 PM

bd808 created this task.

bd808 mentioned this in T216373: CloudVPS: run maintain-dbusers inside Toolforge.Feb 18 2019, 6:26 PM

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.Feb 19 2019, 1:08 AM

We need to be careful with huge QCOW2 files because moving them around will be really painful.

This will not be a problem once we have networked block storage, as the NFS servers would be just acting as app servers. In a way, networked block storage is a blocker for this task.

There is also a question about network throughput with the hypervisor's NIC being used to retrieve data from the distributed storage node and send it out to the NFS client. A dedicated backend network could alleviate these issues.

               A                    B
        +--------------+    +---------------+
        |              |    |               |
+-------v-----+    +---+----v-----+    +----+---------+
|             |    |              |    |              |
|  Ceph Node  |    |  NFS Server  |    |  NFS Client  |
|             |    |              |    |              |
+-------+-----+    +---^----+-----+    +----^---------+
        |              |    |               |
        +--------------+    +---------------+
               C                    D

NFS Server: A,B,C,D are flowing through a single 10GbE NIC in this diagram, potentially offering 25% in each direction. That's for a single VM, the hypervisor is likely to be running others.

It seems this ticket should be closed in light of the Ceph goal, right?

Closed or just listed as blocked.

If listed as blocked, it probably would want to be totally rewritten. Probably closed lol.

• GTirloni closed this task as Declined.Feb 21 2019, 3:58 PM

My thought process when writing this was not about using virtualization to turn our pets into cattle. It was about network isolation benefits of putting the NFS servers into the 172.16 network. I agree that evacuating enormous QCOW2 files to another cloudvirt would be functionally impossible.

Converting the base hardware to a cloudvirt and using virtualization for network isolation is the same thing that we are in the process of doing right now for the ToolsDB, OpenStreetMaps, and WikiLabels databases. Maybe we should get more experience with those instances however before we rush into repeating the pattern for other services.

Opening up for discussion and consideration from that viewpoint.

• Bstorm moved this task from Inbox to Needs discussion on the cloud-services-team (Kanban) board.Feb 22 2019, 5:30 PM

I would like to give a reminder that we don't need to convert the hardware to a 'cloudvirt' server to have it available in the openstack instance network.
We could just hook an additional NIC to the 172.16 subnet/VLAN and then reserve that address in neutron for that concrete NIC. This was mentioned already somewhere by some of you, and just refreshing the idea here.

Quick and dirty diagram:

In T216422#4993392, @aborrero wrote:

I would like to give a reminder that we don't need to convert the hardware to a 'cloudvirt' server to have it available in the openstack instance network.
We could just hook an additional NIC to the 172.16 subnet/VLAN and then reserve that address in neutron for that concrete NIC. This was mentioned already somewhere by some of you, and just refreshing the idea here.

Great point @aborrero. One question to consider if we took this approach: would having a storage server (NFS or otherwise) attached to both the production 10.x network and the cloud tenant 172.16.x network actually provide any isolation or protection to the prod network from possible attacks originating from the cloud tenant network?

That's an interesting thought. Without digging into the security implications here, this would simply change where the management of the rules lies. I'm not sure it actually would be a gain for isolation. I think it would be a gain for convenience in a way that is perhaps not actually good.

If we consider the worst case, which is an attacker gaining control on the storage server proc, in the cloudvirt case there is still a hypervisor layer between the attacker and production network.
We could do a similar thing and isolate the storage server daemon/proc in the shared network case by using namespaces. That namespace layer would be the only thing between the compromised process and production network.

Kind of typical security tradeoff.

aborrero moved this task from Needs discussion to Graveyard on the cloud-services-team (Kanban) board.Mar 12 2019, 4:29 PM

aborrero moved this task from Graveyard to Needs discussion on the cloud-services-team (Kanban) board.

WMCS meeting result:

Openstack ironic recommended by Chase instead of trying to isolate dual-homed ourselves
Continue to serve NFS as we do today, and pospone this indifinetly.
NFS in general: brooke suspect of nfs-exportd interaction with LDAP may briefly cut connection with sge nodes

I'm reopening this task because, with the advent of ceph and what I now consider a pretty stable and good design of NFS storage, I think a VM-based shared-storage cluster might actually make sense, eventually.

This task cannot be worked on much until Ceph build-out is done and likely the implementation of cinder.

Tgr mentioned this in T237820: Set up automated backups on Wikispore.Aug 30 2020, 8:31 AM

Ceph is here! T261132
Cinder will be piloted soon and this should be able to move forward when capacity is available.

Things like this are also related to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS

We now have cinder. We still need additional space to do this, and we also need to be sure we actually want to do it. There are some complications in it.

It doesn't seem like we want to because the performance characteristics and network flows will not really be good. This would likely need to be scrapped in favor of doing something like hardware management in openstack.

• Bstorm closed this task as Declined.Jun 30 2021, 2:57 PM

	F28307465: image.png
	Mar 1 2019, 10:32 AM

Virtualize NFS servers used exclusively by Cloud VPS tenantsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Virtualize NFS servers used exclusively by Cloud VPS tenants
Closed, DeclinedPublic
Actions

Related Objects
Search...