Page MenuHomePhabricator

Migrate tools nfs from labstore1004 server to a ceph-backed VM
Closed, ResolvedPublic

Description

Both user $home directories and tool home directories are stored on an NFS server that runs on the bare-metal labstore1004 NFS server. Rather than refresh this hardware, we're moving NFS into the cloud realm where the data can be served by a VM NFS server.

The new NFS volume is 'tools-nfs' in the tools project, 10TB. It is attached to tools-nfs-2.tools.eqiad1.wikimedia.cloud which will server as the NFS server. I've already rsync'd from labstore1004 to the new volume, but tools nfs is very busy and falls out of sync quickly.

Migration steps:

  • switch the mount on labstore1004 to read-only *toolforge outage begins*
# puppet agent --disable "switching nfs exports to read-only, forever"
# systemctl stop nfs-exportd.service 
# vi /etc/exports.d/tools.exports  # and replace -rw with -ro
# /usr/sbin/exportfs -ra
  • reduce our workload a bit by truncating log files on labstore1004
root@labstore1004:/srv/tools/shared/tools# find . -name '*.out' -size +100M -exec truncate --size=10M {} \;
root@labstore1004:/srv/tools/shared/tools# find . -name '*.err' -size +100M -exec truncate --size=10M {} \;
root@labstore1004:/srv/tools/shared/tools# find . -name '*.log' -size +100M -exec truncate --size=10M {} \;
  • run an rsync refresh from labstore1004 on tools-nfs-2:
rsync -av  --delete --sparse /mnt/nfs/labstore-secondary-tools-home/ /srv/tools/home/ # should take 30-40 minutes
rsync -av  --delete --sparse  /mnt/nfs/labstore-secondary-tools-project/ /srv/tools/project/ # should take 10-12 hours
  • check the immutability of replica.my.cnf, fix if necessary
root@tools-nfs-2:/srv/tools/project# find . -name replica.my.cnf -exec chattr +i {} \;

followup tasks:

  • find a reliable backup solution for the new nfs volume
  • ensure tool deletion still works (T170355)
  • prune out obsolete nfs-on-metal puppet code
  • move or replace nfs monitoring (read/write, disk space)

Event Timeline

Have we considered separating the user homes and tool homes into 2 different NFS servers with 2 different cinder volume? I could be beneficial for workload separation, also making slightly smaller volumes with easier data management, etc.

Have we considered separating the user homes and tool homes into 2 different NFS servers with 2 different cinder volume? I could be beneficial for workload separation, also making slightly smaller volumes with easier data management, etc.

I did consider, although not seriously. You're right about advantanges, although the disadvantage is that it gives us one more server to maintain. One other advantage of splitting is that ssh would still work without the tool volume as long as the user home volume is still running.

The user homes share is much smaller, only about 500GB. So splitting it off wouldn't make much of a dent in the other share which would be 6.5TB.

Change 904562 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Toolforge: move to new VM-hosted NFS server

https://gerrit.wikimedia.org/r/904562

Have we considered separating the user homes and tool homes into 2 different NFS servers with 2 different cinder volume? I could be beneficial for workload separation, also making slightly smaller volumes with easier data management, etc.

I did consider, although not seriously. You're right about advantanges, although the disadvantage is that it gives us one more server to maintain. One other advantage of splitting is that ssh would still work without the tool volume as long as the user home volume is still running.

The user homes share is much smaller, only about 500GB. So splitting it off wouldn't make much of a dent in the other share which would be 6.5TB.

Ok thanks! +1 to keeping them together.

Need to research traffic-shaping (modules/labstore/manifests/traffic_shaping.pp)

Change 904627 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc

https://gerrit.wikimedia.org/r/904627

Andrew updated the task description. (Show Details)

Change 904630 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] labstore1004: park in an 'insetup' role until we're ready to decom

https://gerrit.wikimedia.org/r/904630

Mentioned in SAL (#wikimedia-cloud) [2023-04-03T06:52:52Z] <taavi> stop jobs-framework-emailer to prevent spam due to NFS being read-only T333477

Mentioned in SAL (#wikimedia-cloud) [2023-04-03T08:28:20Z] <taavi> stop exim4.service on tools-sgecron-2 T333477

Change 904562 merged by Andrew Bogott:

[operations/puppet@production] Toolforge: move to new VM-hosted NFS server

https://gerrit.wikimedia.org/r/904562

Change 904627 merged by Andrew Bogott:

[operations/puppet@production] nfs traffic_shaping: replace labstore1004 rules with rules for tools-nfs.svc

https://gerrit.wikimedia.org/r/904627

Change 905229 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::wmcs::nfsclient: move toolforge nodes to the new NFS server

https://gerrit.wikimedia.org/r/905229

Change 905229 merged by Andrew Bogott:

[operations/puppet@production] profile::wmcs::nfsclient: move toolforge nodes to the new NFS server

https://gerrit.wikimedia.org/r/905229

Change 905237 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: move tools nfs to new VM

https://gerrit.wikimedia.org/r/905237

Change 905237 merged by David Caro:

[operations/puppet@production] maintain_dbusers: move tools nfs to new VM

https://gerrit.wikimedia.org/r/905237

Change 905252 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] replica_cnf: update the tools paths

https://gerrit.wikimedia.org/r/905252

Change 905252 merged by David Caro:

[operations/puppet@production] replica_cnf: update the tools paths

https://gerrit.wikimedia.org/r/905252

Just a note that this major event could have been much better publicised. I wasted a significant amount of time yesterday trying to debug my tool to find out why it wasn't working properly, before finally stumbling across this planned outage. I saw no notice on affected projects. I found nothing searching the toolforge Wiki. I saw nothing in the MOTD.

Just a note that this major event could have been much better publicised. I wasted a significant amount of time yesterday trying to debug my tool to find out why it wasn't working properly, before finally stumbling across this planned outage. I saw no notice on affected projects. I found nothing searching the toolforge Wiki. I saw nothing in the MOTD.

You should subscribe to the cloud-announce mailing list, where it was announced.

Change 904630 merged by Andrew Bogott:

[operations/puppet@production] labstore1004: park in an 'insetup' role until we're ready to decom

https://gerrit.wikimedia.org/r/904630

Hallo. Re: the User-notice that was added, I'm uncertain whether, and if so how, to include this in the upcoming Tech News.
From my non-dev glance, it appears that it might be too late (and thus not-useful) to announce this after the 2 events have already occured, but perhaps there is still something that onwiki folks who don't read cloud-announce could usefully do on/after Monday when the next issue is delivered?
If you believe that it needs to be included, please could you suggest a draft summary, before it is frozen in ~22 hours? (1-3 short and simple sentences, ideally with 1 link to further details). Thank you!

@Quiddity I'll try drafting a summary, it's not essential to include any of this, but it might be useful for folks who don't read cloud-announce as you mentioned.

  • The Toolforge NFS server was running on aging hardware and lacked a straightforward path for maintenance or upgrading. To improve this we moved it to a Cinder+VM platform which should support easier upgrades, migrations, and expansions in the future. Link: https://phabricator.wikimedia.org/T333477

Thanks @fnegri ! My apologies, I should've been clearer. We don't usually include purely informational items (except for widespread/high-impact problems/outages), but instead focus on things like action-items, or highlighting new usable features for editors.
IIUC: there are no potential end-user action items for the 1st entry; but there is a potential action item for the 2nd entry.
I can imagine something like this, but I'm not sure of the details, and more importantly I'm not sure if many thousands of people need to see this outside of cloud-announce@:

I'll hesitantly not include it, and hence remove the User-notice, unless someone strongly recommends otherwise. Thanks again.

Change 907133 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: remove labstore term

https://gerrit.wikimedia.org/r/907133

Change 907135 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: openstack: drop NAT exceptions for nfs-tools-project

https://gerrit.wikimedia.org/r/907135

Change 907136 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] wmnet: Remove nfs-tools-project.svc.eqiad

https://gerrit.wikimedia.org/r/907136

Just a note that this major event could have been much better publicised. I wasted a significant amount of time yesterday trying to debug my tool to find out why it wasn't working properly, before finally stumbling across this planned outage. I saw no notice on affected projects. I found nothing searching the toolforge Wiki. I saw nothing in the MOTD.

You should subscribe to the cloud-announce mailing list, where it was announced.

That's fair, but for a disruption of this magnitude, I would have expected to find information in the three other places I listed.

Change 907133 merged by jenkins-bot:

[operations/homer/public@master] cr-cloud: remove labstore term

https://gerrit.wikimedia.org/r/907133

fnegri changed the task status from Open to In Progress.Apr 12 2023, 4:45 PM
fnegri triaged this task as High priority.

Change 909612 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] O:wmcs::nfs: remove primary_backup::tools and related classes

https://gerrit.wikimedia.org/r/909612

Change 907135 merged by Andrew Bogott:

[operations/puppet@production] hieradata: openstack: drop NAT exceptions for nfs-tools-project

https://gerrit.wikimedia.org/r/907135

Change 909612 merged by Andrew Bogott:

[operations/puppet@production] O:wmcs::nfs: remove primary_backup::tools and related classes

https://gerrit.wikimedia.org/r/909612

Change 911424 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Remove unused role and profile for wmcs project- and home- nfs servers

https://gerrit.wikimedia.org/r/911424

Change 911424 merged by Andrew Bogott:

[operations/puppet@production] Remove unused role and profile for wmcs project- and home- nfs servers

https://gerrit.wikimedia.org/r/911424

Change 907136 merged by FNegri:

[operations/dns@master] wmnet: Remove nfs-tools-project.svc.eqiad

https://gerrit.wikimedia.org/r/907136