[NFS] Reduce or eliminate bare-metal NFS servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Sep 20 2021, 4:36 PM

Description

T290602 has inspired some frantic conversation about the future of our NFS servers. The current plan is:

Decisions taken:

we will use regular NFS VMs, one per share
all in the cloudinfra-nfs VPS project (to be created)
volume backups will happen using cinder-backup service, on cloudbackup2001 (codfw datacenter)
will automate the provisioning of the NFS VMs using cookbooks
will do a first run of the migration process and iterate on that

Done:

Tested creating NFS VMs using cinder volumes manually with puppet config and tested mounting it on toolsbeta

Doing:

Setup cinder-backups service on cloudbackup2001 an link it to the eqiad cluster
Automate with cookbooks the creation of the NFS VMs and volumes
Do a test run of the migration procedure with one of the less busy shares (scratch/misc)

To define:

How/what to monitor/alert on for this system
Iterate on the migration procedure on how to migrate the rest of the shares
Add a script to trigger the volume backups on clouddb on a weekly basis

Notes:
CephFS use is not in our immediate plans because that opens complicated networking/DC questions that we're not ready to think about

Details

Subject	Repo	Branch	Lines +/-
wmcs: Remove unused role wmcs::nfs::secondary	operations/puppet	production	+1 -230
Move cloudstore1008/1009 to role::spare	operations/puppet	production	+1 -1
profile::wmcs::nfs::standalone: keep the nfs service running	operations/puppet	production	+1 -1
cloudnfs: Add a hiera key to switch scratch hosting on or off	operations/puppet	production	+5 -0
nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs	operations/puppet	production	+4 -0
cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server	operations/puppet	production	+1 -1
profile::wmcs::nfs::standalone: bind service IP to VM	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T272395 Cloud: reduce NAT exceptions from cloud to production
Resolved		Andrew	T291405 [NFS] Reduce or eliminate bare-metal NFS servers
Resolved		aborrero	T291257 Cloud: NFS: PoC: manila with generic driver using DHSS=true
Resolved		Andrew	T291406 POC: puppet-provision a cinder-backed NFS server in eqiad1
Duplicate		None	T291409 [NFS] Update maintain_dbusers so it can run on a VM
Resolved		Andrew	T292546 cloud NFS: figure out backups for cinder volumes
Resolved		aborrero	T293752 cloud ceph: refactor rbd client puppet profiles
Duplicate		None	T294429 cinder-backups: figure out automation
Resolved		aborrero	T295584 eqiad: 2 VMs for cloudbackup-dev
Resolved		aborrero	T296413 cinder: get victoria point release in the bpo repo
Resolved		aborrero	T299708 network access to eqiad ceph cluster from cloudbackup2002
Resolved		Andrew	T339830 cinder-backup getting OOM-killed for large volumes
Resolved		Andrew	T344065 Replace cinder-backup process with backy2
Resolved		Andrew	T358855 Use cloudbackup100[12]-dev for cinder backup test/dev
Resolved		Andrew	T366071 'backy2 cleanup' not getting called properly on cloudbackup hosts
Resolved		Andrew	T293800 [NFS] Automate creation of the NFS VM and the cinder volume to attach to it
Resolved		Andrew	T293801 [NFS] Test the share migration process to an NFS VM
Open		dcaro	T293804 [NFS] Add monitoring and alerting to the new NFS system
Resolved		Andrew	T293805 [NFS] Create script to automate the cinder volume backups on cloudbackup2001
Resolved		Andrew	T295592 deploy cinder-backups for eqiad1
Resolved		Andrew	T306200 CRITICAL: Status of the systemd unit backup_cinder_volumes
Resolved		• nskaggs	T312847 SystemdUnitDownForLong cloudcontrol1005:9100 Unit backup_cinder_volumes.service on node cloudcontrol1005 has been down for long.
Open		None	T301279 NFS-on-ceph: monitoring
Resolved		Andrew	T301280 Move project-specific NFS mounts onto project-local NFS servers
Resolved		Andrew	T301294 Does account-creation-assistance really need NFS?
Resolved		Andrew	T301295 Does the cloud-vps UTRS project need NFS?
Invalid		Andrew	T301297 Does the cloud-vps Video project need NFS?
Resolved		Andrew	T301298 Does the cloud-vps WikiPathways project need NFS?
Declined		Andrew	T301299 Does the cloud-vps wmde-templates-alpha project need NFS?
Resolved		Andrew	T301300 Does the 'math' project need NFS?
Declined		Andrew	T301301 Does the cloud-vps 'huggle' project need NFS?
Resolved		Andrew	T300694 Move cloud-vps Maps nfs share to vm-hosted NFS
Declined		Andrew	T301620 Does the 'dumps' project need NFS?
Resolved		Andrew	T301646 Does the 'wikilink' project really need NFS?
Resolved		jsn.sherman	T302401 migrate wikilink backups from nfs to cinder
Resolved		Andrew	T301715 Stop using NFS for the project-proxy cloud-vps project
Resolved		dcaro	T303663 Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM
Resolved		dcaro	T304040 REST api service to manage toolforge replica.my.cnf
Open		None	T214541 python3-ldap3 mixed versions and future traps
Resolved		• taavi	T329377 [bug] Server does not start
Resolved		• Marostegui	T330697 labsdbaccounts database grant for cloudcontrol1005
Resolved		None	T330916 Ensure that labstore1004.eqiad.wmnet accept http requests from cloudcontrol1005.wikimedia.org
Resolved		• Marostegui	T331014 Add GRANT access to cloudcontrol1005 for labsadmin to wikireplicas
Resolved		rook	T331056 PAWS cluster for nfs cutover
Resolved	BUG REPORT	dcaro	T332762 New tool not allowed to connect to toolsdb
Invalid	BUG REPORT	dcaro	T332789 [maintain-dbusers] when filtering by a tool we use inconsistent filters
Resolved	BUG REPORT	dcaro	T332798 [maintain-dbusers] When creating accounts, the script bails out processing other accounts if one of them fails in an unexpected way
Open		None	T332955 [maintain-dbusers] Generate prometheus metrics
Resolved		dcaro	T332954 [maintain-dbusers] allow filtering by account type when running maintain
Resolved		None	T325012 Is paws-nfs-1 still used
Resolved		Andrew	T333477 Migrate tools nfs from labstore1004 server to a ceph-backed VM
Resolved		Andrew	T322219 Refactor tool deletion code for nfs-on-cinder

Event Timeline

Andrew created this task.Sep 20 2021, 4:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2021, 4:36 PM

Notes from our just-completed meeting (for future reference):

Our NFS servers pools are:

Tools Using 6 of 8TB
Maps Using 5 of 8TB
Other projects not tools or maps Using 2 of 5 TB
Scratch using 2 of 4 TB

Total: Using 15 out of allocated 25 TB
(Ceph currently has ~35 available TB)

Dumps (read-only and SO BIG that we aren't talking about this today) (dumps is worked on by Ariel and also some analytics/Data Engineering folks)

Status Quo
- Pros: It's the status quo, using DRBD
- Cons: Clunky, requires domain-specific knowledge, Violates network separation rules

Status Quo but with rsync backup instead of drbd
- Pros: not needing to understand drbd
- Cons: Potential data loss between backups; violates network separation, clunky

Existing server model but on VMs (no openstack manila)
- Pros: fewer kinds of hardware, fewer kinds of networks, we could start doing this today!
- Cons: possible network congestion, heavy Ceph usage, possibly difficult migration

- The Backup Plan *****
Some different server model on VMs (e.g. more servers but no automatic provisioning)
- Pros: roughly the same as above but possibly with better load/risk distribution, we could start doing this today!
- Cons: roughly the same as above

- The WINNER ******
Proper openstack-native share management via Manila
- Pros: builds VMs with nova, cinder volumes, etc. More or less automates the VM model? Also supports quotas. Could flip later to cephfs easily
- Cons: Tools would still be it's own project (WMCS would have to manage); less flexibility to configure NFS since Manila will want us to treat it as a black box

CephFS
- Pros: quotas?, supported by Manila
- Cons: new/unknown, requires network proxy. How can you authenicate? (Ceph is in the production realm)

Open questions:

Do we want to put NFS data into Ceph?
- Ceph is the only scalable performant solution.
- What about backups? Could use backy2, reusing existing backup servers and jobs.. Not everything can/will be 100% backed up.

What about HA?
- DRBD'd NFS servers are in the same rack, given the direct cable. Limits physical setup
- Don't auto failover as-is

What about network traffic?
- Think carefully about network setup and flows between racks
- One reason NFS in VM's won't work is because of bandwidth constraints / concerns; at least as we build VM's now
- IE, create a dedicated cloud-virt to host NFS VM's, etc

Which of those scenarios requires us to re-learn all of the performance throttling that we've learned with our existing setup?

DON'T DO NFS soft-mounts. Once they time-out, they wont recover and you need to reboot the VM.

How can we further seperate the tools share? Making seperate shares for quota / performance reasons

Andrew added a subtask: T291257: Cloud: NFS: PoC: manila with generic driver using DHSS=true.Sep 20 2021, 4:37 PM

Mentioned in SAL (#wikimedia-cloud) [2021-09-20T21:57:03Z] <andrewbogott> moving cloudvirt1043 into the 'nfs' aggregate for T291405

aborrero changed the status of subtask T291257: Cloud: NFS: PoC: manila with generic driver using DHSS=true from In Progress to Open.Oct 5 2021, 9:34 AM

aborrero closed subtask T291257: Cloud: NFS: PoC: manila with generic driver using DHSS=true as Resolved.Oct 5 2021, 10:13 AM

aborrero mentioned this in T270071: SVC DNS zonefiles and source of truth.Oct 5 2021, 2:44 PM

aborrero mentioned this in T291065: cloudstore1008/1009: buster upgrade -- nfs secondary: maps, scratch, misc shares.Oct 5 2021, 2:58 PM

aborrero mentioned this in T291068: cloud NFS: try out stretch <-> buster DRBD replication and other migration stuff.

dcaro closed subtask T291406: POC: puppet-provision a cinder-backed NFS server in eqiad1 as Resolved.Oct 19 2021, 3:31 PM

dcaro added a subtask: T293800: [NFS] Automate creation of the NFS VM and the cinder volume to attach to it.Oct 19 2021, 3:34 PM

dcaro renamed this task from Reduce or eliminate bare-metal NFS servers to [NFS] Reduce or eliminate bare-metal NFS servers.Oct 19 2021, 3:43 PM

dcaro added a subtask: T293801: [NFS] Test the share migration process to an NFS VM.

dcaro triaged this task as High priority.Oct 19 2021, 3:50 PM

dcaro updated the task description. (Show Details)

dcaro added subscribers: dcaro, aborrero.

dcaro updated the task description. (Show Details)Oct 19 2021, 4:00 PM

aborrero added a parent task: T272395: Cloud: reduce NAT exceptions from cloud to production.Oct 21 2021, 10:25 AM

dcaro added a subtask: T294432: [ceph] Enable encrypted client traffic for the ceph clusters.Oct 27 2021, 1:30 PM

dcaro removed a subtask: T294432: [ceph] Enable encrypted client traffic for the ceph clusters.Oct 27 2021, 4:07 PM

Change 753100 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::wmcs::nfs::standalone: bind service IP to VM

https://gerrit.wikimedia.org/r/753100

gerritbot added a project: Patch-For-Review.Jan 11 2022, 4:59 PM

Change 753100 merged by Andrew Bogott:

[operations/puppet@production] profile::wmcs::nfs::standalone: bind service IP to VM

https://gerrit.wikimedia.org/r/753100

Maintenance_bot removed a project: Patch-For-Review.Jan 11 2022, 6:10 PM

Andrew claimed this task.Jan 14 2022, 9:54 PM

Change 754043 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server

https://gerrit.wikimedia.org/r/754043

gerritbot added a project: Patch-For-Review.Jan 14 2022, 10:49 PM

Benjavalero mentioned this in T299469: 400 Bad Request Error when using Replacer tool on eswiki.Jan 19 2022, 7:11 AM

Change 754043 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server

https://gerrit.wikimedia.org/r/754043

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2022, 5:10 PM

Agusbou2015 subscribed.Jan 19 2022, 7:28 PM

Change 758998 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs

https://gerrit.wikimedia.org/r/758998

Change 758998 merged by Andrew Bogott:

[operations/puppet@production] nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs

https://gerrit.wikimedia.org/r/758998

Maintenance_bot removed a project: Patch-For-Review.Feb 2 2022, 2:10 AM

Change 761438 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudnfs: Add a hiera key to switch scratch hosting on or off

https://gerrit.wikimedia.org/r/761438

gerritbot added a project: Patch-For-Review.Feb 9 2022, 6:42 PM

Change 761438 merged by Andrew Bogott:

[operations/puppet@production] cloudnfs: Add a hiera key to switch scratch hosting on or off

https://gerrit.wikimedia.org/r/761438

Andrew closed subtask T300694: Move cloud-vps Maps nfs share to vm-hosted NFS as Resolved.Feb 9 2022, 7:00 PM

Change 761981 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::wmcs::nfs::standalone: keep the nfs service running

https://gerrit.wikimedia.org/r/761981

Change 761981 merged by Andrew Bogott:

[operations/puppet@production] profile::wmcs::nfs::standalone: keep the nfs service running

https://gerrit.wikimedia.org/r/761981

Change 773819 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Move cloudstore1008/1009 to role::spare

https://gerrit.wikimedia.org/r/773819

Change 773819 merged by Andrew Bogott:

[operations/puppet@production] Move cloudstore1008/1009 to role::spare

https://gerrit.wikimedia.org/r/773819

Andrew closed subtask T293800: [NFS] Automate creation of the NFS VM and the cinder volume to attach to it as Resolved.Apr 4 2022, 3:19 PM

Change 779446 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs: Remove unused role wmcs::nfs::secondary

https://gerrit.wikimedia.org/r/779446

Change 779446 merged by David Caro:

[operations/puppet@production] wmcs: Remove unused role wmcs::nfs::secondary

https://gerrit.wikimedia.org/r/779446

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Sep 27 2022, 9:31 PM

Andrew closed subtask T295592: deploy cinder-backups for eqiad1 as Resolved.Sep 27 2022, 10:28 PM

Andrew closed subtask T292546: cloud NFS: figure out backups for cinder volumes as Resolved.Dec 15 2022, 10:11 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:25 PM

fnegri moved this task from Kanban to Doing? (legacy column) on the cloud-services-team board.

fnegri moved this task from Doing? (legacy column) to Inbox on the cloud-services-team board.Jan 19 2023, 1:02 PM

Andrew closed subtask T322219: Refactor tool deletion code for nfs-on-cinder as Resolved.Apr 14 2023, 12:22 AM

Maintenance_bot removed a project: Patch-For-Review.Apr 14 2023, 12:29 AM

I think we're now down to the minimum -- just dumps (which are huge) are on metal and everything is on VMs.

Andrew closed subtask T301280: Move project-specific NFS mounts onto project-local NFS servers as Resolved.May 23 2023, 1:46 PM

Andrew closed subtask T293805: [NFS] Create script to automate the cinder volume backups on cloudbackup2001 as Resolved.Jun 19 2023, 2:55 AM

Andrew closed subtask T293801: [NFS] Test the share migration process to an NFS VM as Resolved.Wed, Jul 24, 2:34 PM

[NFS] Reduce or eliminate bare-metal NFS serversClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[NFS] Reduce or eliminate bare-metal NFS servers
Closed, ResolvedPublic
Actions

Related Objects
Search...