Page MenuHomePhabricator

Set up scratch and maps NFS services on cloudstore1008/9
Closed, ResolvedPublic

Description

Now that these stretch cloudstore servers are images and racked, we need to construct a puppetization that involves replication (DRBD not needed, just periodic data replication), some sense of failover, and NFS services.

The current provider of these services is labstore1003, which is to be decommissioned as soon as this setup is live for those purposes.

Note that part of the existing storage on these new boxes is to enable a future sync service for data backup, etc that will not be on NFS.

Details

Related Gerrit Patches:
operations/puppet : productioncloudstore: scratch is a rw mount
operations/puppet : productioncloudstore: switch maps mounts from labstore1003 to cloudstore1008
operations/puppet : productioncloudstore: switch scratch mounts from labstore1003 to cloudstore1008
operations/puppet : productioncloudstore: increase sync speed for data migration
operations/puppet : productionlabstore: treat maps mounts differently for efficiency
operations/puppet : productionrsync: equal sign was removed from quickdatacopy bwlimit
operations/puppet : productionlabstore: fix a couple errors and tidy mounts for migration
operations/puppet : productionlabstore: shut down old rsync and export materials for migration
operations/puppet : productioncloudstore: start syncing data off labstore1003
operations/puppet : productionrsync: add a bwlimit option for quickdatacopy
operations/puppet : productioncloudstore: fix one more mistake in syncserver
operations/puppet : productioncloudstore: add to role for the syncing
operations/puppet : productioncloudstore: add to the script for syncserver
operations/puppet : productioncloudstore: touch up the script a bit from testing
operations/puppet : productioncloudstore: correct python syntax
operations/puppet : productioncloudstore: the cluster ip must be passed through
operations/puppet : productioncloudstore: cleanup extraneous bits
operations/puppet : productioncloudstore: finish up the script for sync
operations/puppet : productioncloudstore: change direction a bit on the rsync methods
operations/puppet : productioncloudstore: edit ferm rules a bit more
operations/puppet : productioncloudstore: correct problems in ferm rules
operations/puppet : productioncloudstore: add ferm rules for rsync on the scratch/maps cluster
operations/puppet : productioncloudstore: introduce rsync framework for secondary cluster
operations/puppet : productioncloudstore: test failover of cloudstore1008 to cloudstore1009
operations/puppet : productioncloudstore: add ping check for ip conflict
operations/puppet : productioncloudstore: fix the interface name and add a comment
operations/puppet : productioncloudstore: fail over ip address via hiera for scratch/maps cloudstore
operations/puppet : productioncloudstore: in stretch, the location of default nsswitch is different
operations/puppet : productioncloudstore: refactor nfsclient role into profile
operations/puppet : productioncloudstore: add python3 clientpackages for all
operations/puppet : productioncloudstore: change version to newton for cloudstore1008/9
operations/puppet : productioncloudstore: fix up some params around the rsync jobs
operations/puppet : productioncloudstore: deploy maps/scratch cluster as nfs::secondary
operations/puppet : productionlabstore: refactor the backup roles so they will match the main roles
operations/puppet : productionlabstore: a touch more cleanup of the secondary modules
operations/puppet : productioncloudstore: A bit more cleanup
operations/puppet : productionlabstore: Adapt nfs-exportd to be used on more than one cluster
operations/puppet : productionlabstore: Adapt nfs-exportd to be used on more than one cluster
operations/puppet : productioncloudstore: add extension and get nfs-manage-binds passing linter
operations/puppet : productionlabstore: cleanup the remaining files after Icc89332f0e779
operations/puppet : productionlabstore: fix mistake in maintain_dbusers service
operations/puppet : productioncloudstore: start refactor for role switch up around the labstores
operations/puppet : productioncloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere
operations/puppet : productionwmcs::nfs::misc - Backup for misc server (cloudstore1008)
operations/puppet : productionwmcs::nfs::misc - Fix ferm rule
operations/puppet : productionwmcs::nfs::misc - Allow incoming SSH from backup server
operations/puppet : productionwmcs::nfs - Refactor snapshot_manager
operations/puppet : productionwmcs::nfs::misc - Fix roles
operations/puppet : productionblock_sync - Adjust SSH private key filename
operations/puppet : productionwmcs::nfs::misc - Fix sshd config
operations/puppet : productionwmcs::nfs::misc - Fixes and backup role
operations/puppet : productioncloudstore1008/9: reimage with buster
operations/puppet : productionwmcs::nfs::misc - Disable notifications temporarily
operations/puppet : productionlabstore - Allow multiple bdsync jobs per host
labs/private : masterlabstore: Add id_cloudstore
operations/puppet : productionlabstore::device_backup - Expose systemd OnCalendar syntax
operations/puppet : productionwmcs::nfs::misc - Remove unused /srv/* exports
operations/debs/nfsd-ldap : masterRebuild for Stretch and add .gitreview
operations/puppet : productionwmcs::nfs::misc - Add nfsd-ldap package back
operations/puppet : productionRevert "wmcs::nfs::misc - Remove wmcs-root from admin groups"
operations/puppet : productionwmcs::nfs::misc - Remove wmcs-root from admin groups
operations/puppet : productionwmcs::nfs::misc - Second attempt to fix nsswitch.conf
operations/puppet : productionwmcs::nfs::misc - Configure nsswitch.conf
operations/puppet : productionwmcs::nfs::misc - Fix typo and nsswitch.conf file
operations/puppet : productionwmcs::nfs::misc - Refactor into profile/role
operations/puppet : productionSet spare role for cloustore1008/1009

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 503123 merged by Bstorm:
[operations/puppet@production] labstore: refactor the backup roles so they will match the main roles

https://gerrit.wikimedia.org/r/503123

Change 502342 merged by Bstorm:
[operations/puppet@production] cloudstore: deploy maps/scratch cluster as nfs::secondary

https://gerrit.wikimedia.org/r/502342

Bstorm removed a subscriber: aborrero.Apr 19 2019, 7:38 PM

Change 505325 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix up some params around the rsync jobs

https://gerrit.wikimedia.org/r/505325

Change 505325 merged by Bstorm:
[operations/puppet@production] cloudstore: fix up some params around the rsync jobs

https://gerrit.wikimedia.org/r/505325

Change 505333 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505333 merged by Bstorm:
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: try setting the openstack version differently

https://gerrit.wikimedia.org/r/505339

Change 505339 merged by Bstorm:
[operations/puppet@production] cloudstore: add python3 clientpackages for all

https://gerrit.wikimedia.org/r/505339

Change 506319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

For the hard mounts, going to use 208.80.155.119/2620:0:861:4:208:80:155:119/nfs-maps.wikimedia.org as a floating IP.

Change 506319 merged by Bstorm:
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

Change 506472 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506472 merged by Bstorm:
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506714 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

This is nearing completion in terms of the functional components. I am sorry to say that travel next week is going to stall the data migration and cutover piece.

For correct status, the rsync jobs are not ready just yet either.

Change 506714 merged by Bstorm:
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

Change 506721 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506721 merged by Bstorm:
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506738 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506751 merged by Bstorm:
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506738 merged by Bstorm:
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506847 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 506847 merged by Bstorm:
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 507094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507094 merged by Bstorm:
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507097 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507097 merged by Bstorm:
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507104 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507104 merged by Bstorm:
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507206 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507206 merged by Bstorm:
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507212 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507212 merged by Bstorm:
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507213 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507213 merged by Bstorm:
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507216 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507216 merged by Bstorm:
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507220 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507220 merged by Bstorm:
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507222 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507222 merged by Bstorm:
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507227 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507227 merged by Bstorm:
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507229 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507229 merged by Bstorm:
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507232 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Change 507232 merged by Bstorm:
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Ok. This all seems to work now. I'm prepared to set up a patch to change the client mounts and start sync jobs to migrate the data. That will wait until I get back, I imagine.

Change 509458 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 509469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

Change 509470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

Change 509458 merged by Dzahn:
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 510185 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

@aude @Awjrichards @Chippyy @cmarqu @coren @dschwen @jeremyb @Kolossos @MaxSem @Multichill @Nosy @TheDJ -- Just a heads up that I'm looking to begin data migration now for maps /home and /project. To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

It's a lot of data, so if the copy starts thrashing performance too much, let me know and I can try to reduce speed or something.

Change 510185 merged by Bstorm:
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

Change 510259 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510259 merged by Bstorm:
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510262 merged by Bstorm:
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

Change 510264 merged by Bstorm:
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

TheDJ added a comment.May 15 2019, 7:08 PM

To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

nfs sessions you mean ?

@TheDJ ssh sessions and possibly processes that run out of home directories or the project directory on NFS. Because it has NFS home directories, you'd want to make sure you re-opened your home directory after the symlink to /home is changed to the new mounts.

Change 510761 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 510761 merged by Bstorm:
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 511445 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Change 511445 merged by Bstorm:
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Scratch has now had one successful sync. Setting the patch to review and finding a reasonable date for it. Theoretically, since scratch shouldn't have a lot of open filehandles, it shouldn't be too bad as long as everything is working right.

Scheduled scratch migration for 2019-05-28@1800 UTC

The maps share finished first rsync this weekend. It can be scheduled for any time now, pretty much, as long as I keep it syncing.

Maps folks: since tearing down this server is a big priority, would you be terribly upset if I tried to swap the mounts as soon as this Wednesday? It would mean:
https://phabricator.wikimedia.org/T209527#5187505

Sorry, it's early: Wednesday is quite soon to even get an answer, but as soon as possible would be my preference. The new servers are more stable, redundant, faster in every way. More importantly, labstore1003 need to turn off.

@Bstorm I do maps-warper3 instance.

Okay with me for Wednesday, can't speak for others. Thank you for keeping us up to date with the progress and letting us know when the change is due to happen!
I'll be sure to have a look once the changes are complete and see if everything is working or not (and report back if things are not!)

Change 509469 merged by Bstorm:
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

TheDJ added a comment.May 28 2019, 6:04 PM

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

Mentioned in SAL (#wikimedia-cloud) [2019-05-28T18:14:18Z] <bstorm_> T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

I'd mostly expect anything running directly out of NFS to require a restart. It seems most of the NFS for maps is dumps and more static content outside of what's in /home? If you aren't logged in, and you don't run much out of /data/project/maps directly, you'll probably not notice :)

In most cases I'd try to give a week's notice at least, but I'm blocking other people's work quite a bit with this migration (and, honestly, you're data will be far safer when this is complete).

Well, I think I might try to merge and switch over the mounts for maps tomorrow at 1800 UTC. Things went fairly smoothly for the scratch migration.

The one thing that is giving me pause on that idea is that there is a significant delta in /data/project/tiles/bw-mapnik. That may be a service running directly on the NFS? If so, that'll have some issues when I do the switch.

I can stop apache on the maps-tiles1 server during the maintenance, if that is helpful/if that's what is running in the project folder. @TheDJ

Hmm. Just confirmed that is true. Apache should be stopped there before the switch and then started after. I can take care of doing that if the maps group isn't available to. The switch itself should be very quick, and I'll be running puppet manually on all nodes together. I will also do a final rsync at the end. Some things might bite the dust when I unmount the old thing if they are holding open the NFS, that's all. Apache will at very least require a restart on that server. maps-wma doesn't seem to use the NFS directly for apache. I'll also try to be available to help with any remediation after this change. So far, the tile server is the main thing I'm able to identify as an issue. I apologize for moving so fast now that the data is transferring fully. That old server is a huge risk.

Chippyy added a comment.EditedMay 29 2019, 11:01 AM

maps-warper3 which runs https://warper.wmflabs.org/ uses /mnt/nfs/labstore1003-maps/project/warper/uploads/ and /mnt/nfs/labstore1003-maps/home/warperdata/ (but referred to via /home and /data/project in the application) for storing a fair bit of data too. I'll turn off the webserver during the move, thanks.

Status update on maps: the delta between last sync and when I started one yesterday was too great to finish by this time.
Obviously, at a certainly point, cutting over and then catching up with another sync is the right idea, but a large delta doesn't help.
How about we aim for Friday, May 31 @ 1800 UTC.

@Chippyy will that work for you? I can do my best to handle the service on the tiles server.

Heh. The sync just completed. I restarted it only to find it has another large delta. I think this just needs to be cut over, and then final sync done with no throttling. I can wait for Friday to make sure folks are at least prepared, though, since I had to miss the time for today.

I am going to begin the switch over. I'll check that apache is off for the tiles and warper instances once the patch has started the trickle-down. Worse comes to worse, a reboot of the instances will be needed to ensure services are restored, and I can help with that as well.

Change 509470 merged by Bstorm:
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

slippymap seems to be working (something I know how to check at least). @Chippyy Could you check on warper3? I'm not sure how to verify that I got it right.

Change 513666 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Change 513666 merged by Bstorm:
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Bstorm closed this task as Resolved.May 31 2019, 6:42 PM

Ok. At this point, anything still using the old mount on maps will have a "stale nfs filehandle" message most likely (and basically needs a restart). Apache is not among those in my testing, but maps folks may find their home directory is for long-running ssh sessions or screens, etc.

I have found an issue with the failover mechanism on the new server that may require service restarts or maps server reboots when the NFS is failed over, so I think I'm going to make a subtask to deal with that in the future--I would rather a more seamless failover.

Please reach out to me/create a task and assign it to me if there is any further issues not resolved by simply restarting something or logging in again.

@Bstorm everything looks fine on maps-warper3 thanks and warper.wmflabs.org is running ok