Page MenuHomePhabricator

POC: puppet-provision a cinder-backed NFS server in eqiad1
Closed, ResolvedPublic

Description

We're going to move tools-beta into a VM-hosted NFS server and see how it goes.

Event Timeline

Andrew renamed this task from POC: hand-provision a cinder-backed NFS server in eqiad1 to POC: puppet-provision a cinder-backed NFS server in eqiad1.Sep 20 2021, 4:58 PM

Change 722418 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] New roles for NFS servers on VMs

https://gerrit.wikimedia.org/r/722418

Change 722420 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: set up a PoC Openstack instance-based nfs server

https://gerrit.wikimedia.org/r/722420

Ok, with the general deploy setup I suggested in my puppet patch, that would get you a server that probably cannot do much. From there, that server would want a floating IP that we can move between hosts easily and a cinder volume. The floating IP should have a DNS name attached that we can use in the client manifests, and then we need a security group that opens port 2049 only to the cloud internal range. That should do it for a very basic setup and test without using a special hypervisor and such.

Right now I'm assuming this will look like the "misc" volume and will only actually be something you'd want to mount on toolsbeta when we migrate the data (which would require an nfs outage to toolsbeta...ideally by unexporting it's share with puppet off or something on the labstore1004 server while completing the rsync)

Change 722418 abandoned by Andrew Bogott:

[operations/puppet@production] New roles for NFS servers on VMs

Reason:

dropping in favor of 722420

https://gerrit.wikimedia.org/r/722418

Change 722420 merged by Bstorm:

[operations/puppet@production] cloudnfs: set up a PoC Openstack instance-based nfs server

https://gerrit.wikimedia.org/r/722420

Mentioned in SAL (#wikimedia-cloud) [2021-09-20T22:36:18Z] <bstorm> created cloudstore-nfs-01 with a floating ip and a 10GB cinder volume T291406

Change 722477 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: switch packages to ensure_packages

https://gerrit.wikimedia.org/r/722477

Change 722477 merged by Bstorm:

[operations/puppet@production] cloudnfs: switch packages to ensure_packages

https://gerrit.wikimedia.org/r/722477

Change 722478 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: remove the redundant nfs-common file

https://gerrit.wikimedia.org/r/722478

Change 722478 merged by Bstorm:

[operations/puppet@production] cloudnfs: remove the redundant nfs-common file

https://gerrit.wikimedia.org/r/722478

Change 722479 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: remove deprecated dependency on nfs-manage-binds

https://gerrit.wikimedia.org/r/722479

Change 722479 merged by Bstorm:

[operations/puppet@production] cloudnfs: remove deprecated dependency on nfs-manage-binds

https://gerrit.wikimedia.org/r/722479

Change 722643 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: remove another conflict with the wmcs/instance profile

https://gerrit.wikimedia.org/r/722643

Change 722643 merged by Bstorm:

[operations/puppet@production] cloudnfs: remove another conflict with the wmcs/instance profile

https://gerrit.wikimedia.org/r/722643

Ok, a test server is live now in the cloudstore project. Since I did it in the lowest-effort method, it is just using the current edition of nfs-exportd, which assumes you host all projects. One problem with it is that I see it uses public IPs for clients that have them. I'm not sure that will work, but there's one way to find out! Time to tinker with the nfs client mounts for toolsbeta.

One thing that could use doing is setting up nfs-exportd (or a certain forked version of it) to accept a list of projects hosted on the VM and only bother exporting those?

Change 722647 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: some cleanup in case we want more forking

https://gerrit.wikimedia.org/r/722647

Change 722647 merged by Bstorm:

[operations/puppet@production] cloudnfs: some cleanup in case we want more forking

https://gerrit.wikimedia.org/r/722647

Change 722658 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: connect toolsbeta to the VM-based NFS server for testing

https://gerrit.wikimedia.org/r/722658

Change 722658 merged by Bstorm:

[operations/puppet@production] cloudnfs: connect toolsbeta to the VM-based NFS server for testing

https://gerrit.wikimedia.org/r/722658

Change 722668 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: adding DNS to nfs server

https://gerrit.wikimedia.org/r/722668

Change 722668 merged by Bstorm:

[operations/puppet@production] cloudnfs: adding DNS to nfs server

https://gerrit.wikimedia.org/r/722668

Change 722689 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] cloudnfs: refactor profile to allow hiera to set server vols

https://gerrit.wikimedia.org/r/722689

Change 722689 merged by Bstorm:

[operations/puppet@production] cloudnfs: refactor profile to allow hiera to set server vols

https://gerrit.wikimedia.org/r/722689

Mentioned in SAL (#wikimedia-cloud) [2021-09-22T18:06:33Z] <bstorm> launching tools-nfs-test-client-01 to run a "fair" test battery against T291406

Mentioned in SAL (#wikimedia-cloud) [2021-09-22T18:07:11Z] <bstorm> launching toolsbeta-nfs-test-client-01 to run a "fair" test battery against T291406

tc_setup.sh should have a "clean" option that will remove all local traffic shaping. If that works correctly, then we should be able to run a fair test between our current setup and the current PoC VM NFS server setup.

We can run some fio, iotop and ioping sort of tests with random dds. I'll try the clean option as soon as build is done.

Some test results https://docs.google.com/spreadsheets/d/1rXXxZwwB9yPir3LrgMdwfM0oTl_5Nv0AyrdnC4rpH5E/edit#gid=2062561847 The doc is, unfortunately, not public, but I also didn't want it owned by me in Gdocs either.

In general, it is clear that the cinder-based VM has some actual advantages over our current NFS server for performance, likely because the disk on our current server is pretty cheap stuff vs. SSDs backing cinder, regardless of replication and networking. On the other hand, the latency is a bit squirrelly in some cases. There are a lot of variables in play. The one thing I think it does show for sure is that fast disks still matter even if you access through a large number of abstractions. Those abstractions are probably the cause of some numbers not being as good as on the rather old labstore1004. Direct NFS is bound to be faster than NFS via ceph->cinder->VM->NFS, but the VM doesn't seem totally incapable though with a high queue length it struggles more. Sequential writes are just not as good, but random ops are nearly always much better on SSD regardless of how you use them. I imagine copying large files (dumps) around would be painful.

Interesting results in general!

Mentioned in SAL (#wikimedia-cloud) [2021-10-04T16:35:45Z] <bstorm> deleting cloudstore-nfs-01 as that was the old instance for "nfs-01" T291406

Mentioned in SAL (#wikimedia-cloud) [2021-10-04T17:06:13Z] <bstorm> use cumin to edit fstab to remove old nfs mounts T291406