Page MenuHomePhabricator

Set up scratch and maps NFS services on cloudstore1008/9
Closed, ResolvedPublic

Description

Now that these stretch cloudstore servers are images and racked, we need to construct a puppetization that involves replication (DRBD not needed, just periodic data replication), some sense of failover, and NFS services.

The current provider of these services is labstore1003, which is to be decommissioned as soon as this setup is live for those purposes.

Note that part of the existing storage on these new boxes is to enable a future sync service for data backup, etc that will not be on NFS.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+8 -9
operations/puppetproduction+19 -39
operations/puppetproduction+2 -2
operations/puppetproduction+7 -1
operations/puppetproduction+1 -1
operations/puppetproduction+18 -18
operations/puppetproduction+22 -23
operations/puppetproduction+16 -1
operations/puppetproduction+4 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+17 -3
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+0 -2
operations/puppetproduction+12 -11
operations/puppetproduction+2 -57
operations/puppetproduction+2 -1
operations/puppetproduction+3 -9
operations/puppetproduction+21 -1
operations/puppetproduction+62 -11
operations/puppetproduction+1 -1
operations/puppetproduction+4 -1
operations/puppetproduction+2 -1
operations/puppetproduction+27 -4
operations/puppetproduction+1 -1
operations/puppetproduction+42 -7
operations/puppetproduction+14 -1
operations/puppetproduction+1 -1
operations/puppetproduction+10 -4
operations/puppetproduction+355 -632
operations/puppetproduction+45 -31
operations/puppetproduction+1 -1
operations/puppetproduction+3 -15
operations/puppetproduction+20 -12
operations/puppetproduction+9 -5
operations/puppetproduction+19 -18
operations/puppetproduction+0 -972
operations/puppetproduction+3 -4
operations/puppetproduction+988 -1
operations/puppetproduction+3 -6
operations/puppetproduction+82 -2
operations/puppetproduction+1 -1
operations/puppetproduction+10 -0
operations/puppetproduction+29 -13
operations/puppetproduction+6 -0
operations/puppetproduction+3 -3
operations/puppetproduction+1 -1
operations/puppetproduction+272 -11
operations/puppetproduction+4 -0
operations/puppetproduction+2 -0
operations/puppetproduction+5 -5
labs/privatemaster+0 -0
operations/puppetproduction+12 -16
operations/puppetproduction+0 -2
operations/debs/nfsd-ldapmaster+11 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+0 -3
operations/puppetproduction+9 -1
operations/puppetproduction+21 -0
operations/puppetproduction+1 -3
operations/puppetproduction+544 -1
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 505333 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505333 merged by Bstorm:
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: try setting the openstack version differently

https://gerrit.wikimedia.org/r/505339

Change 505339 merged by Bstorm:
[operations/puppet@production] cloudstore: add python3 clientpackages for all

https://gerrit.wikimedia.org/r/505339

Change 506319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

For the hard mounts, going to use 208.80.155.119/2620:0:861:4:208:80:155:119/nfs-maps.wikimedia.org as a floating IP.

Change 506319 merged by Bstorm:
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

Change 506472 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506472 merged by Bstorm:
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506714 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

This is nearing completion in terms of the functional components. I am sorry to say that travel next week is going to stall the data migration and cutover piece.

For correct status, the rsync jobs are not ready just yet either.

Change 506714 merged by Bstorm:
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

Change 506721 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506721 merged by Bstorm:
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506738 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506751 merged by Bstorm:
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506738 merged by Bstorm:
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506847 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 506847 merged by Bstorm:
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 507094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507094 merged by Bstorm:
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507097 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507097 merged by Bstorm:
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507104 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507104 merged by Bstorm:
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507206 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507206 merged by Bstorm:
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507212 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507212 merged by Bstorm:
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507213 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507213 merged by Bstorm:
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507216 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507216 merged by Bstorm:
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507220 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507220 merged by Bstorm:
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507222 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507222 merged by Bstorm:
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507227 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507227 merged by Bstorm:
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507229 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507229 merged by Bstorm:
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507232 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Change 507232 merged by Bstorm:
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Ok. This all seems to work now. I'm prepared to set up a patch to change the client mounts and start sync jobs to migrate the data. That will wait until I get back, I imagine.

Change 509458 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 509469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

Change 509470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

Change 509458 merged by Dzahn:
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 510185 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

@aude @Awjrichards @Chippyy @cmarqu @coren @dschwen @jeremyb @Kolossos @MaxSem @Multichill @Nosy @TheDJ -- Just a heads up that I'm looking to begin data migration now for maps /home and /project. To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

It's a lot of data, so if the copy starts thrashing performance too much, let me know and I can try to reduce speed or something.

Change 510185 merged by Bstorm:
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

Change 510259 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510259 merged by Bstorm:
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510262 merged by Bstorm:
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

Change 510264 merged by Bstorm:
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

nfs sessions you mean ?

@TheDJ ssh sessions and possibly processes that run out of home directories or the project directory on NFS. Because it has NFS home directories, you'd want to make sure you re-opened your home directory after the symlink to /home is changed to the new mounts.

Change 510761 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 510761 merged by Bstorm:
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 511445 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Change 511445 merged by Bstorm:
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Scratch has now had one successful sync. Setting the patch to review and finding a reasonable date for it. Theoretically, since scratch shouldn't have a lot of open filehandles, it shouldn't be too bad as long as everything is working right.

Scheduled scratch migration for 2019-05-28@1800 UTC

The maps share finished first rsync this weekend. It can be scheduled for any time now, pretty much, as long as I keep it syncing.

Maps folks: since tearing down this server is a big priority, would you be terribly upset if I tried to swap the mounts as soon as this Wednesday? It would mean:
https://phabricator.wikimedia.org/T209527#5187505

Sorry, it's early: Wednesday is quite soon to even get an answer, but as soon as possible would be my preference. The new servers are more stable, redundant, faster in every way. More importantly, labstore1003 need to turn off.

@Bstorm I do maps-warper3 instance.

Okay with me for Wednesday, can't speak for others. Thank you for keeping us up to date with the progress and letting us know when the change is due to happen!
I'll be sure to have a look once the changes are complete and see if everything is working or not (and report back if things are not!)

Change 509469 merged by Bstorm:
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

Mentioned in SAL (#wikimedia-cloud) [2019-05-28T18:14:18Z] <bstorm_> T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

I'd mostly expect anything running directly out of NFS to require a restart. It seems most of the NFS for maps is dumps and more static content outside of what's in /home? If you aren't logged in, and you don't run much out of /data/project/maps directly, you'll probably not notice :)

In most cases I'd try to give a week's notice at least, but I'm blocking other people's work quite a bit with this migration (and, honestly, you're data will be far safer when this is complete).

Well, I think I might try to merge and switch over the mounts for maps tomorrow at 1800 UTC. Things went fairly smoothly for the scratch migration.

The one thing that is giving me pause on that idea is that there is a significant delta in /data/project/tiles/bw-mapnik. That may be a service running directly on the NFS? If so, that'll have some issues when I do the switch.

I can stop apache on the maps-tiles1 server during the maintenance, if that is helpful/if that's what is running in the project folder. @TheDJ

Hmm. Just confirmed that is true. Apache should be stopped there before the switch and then started after. I can take care of doing that if the maps group isn't available to. The switch itself should be very quick, and I'll be running puppet manually on all nodes together. I will also do a final rsync at the end. Some things might bite the dust when I unmount the old thing if they are holding open the NFS, that's all. Apache will at very least require a restart on that server. maps-wma doesn't seem to use the NFS directly for apache. I'll also try to be available to help with any remediation after this change. So far, the tile server is the main thing I'm able to identify as an issue. I apologize for moving so fast now that the data is transferring fully. That old server is a huge risk.

maps-warper3 which runs https://warper.wmflabs.org/ uses /mnt/nfs/labstore1003-maps/project/warper/uploads/ and /mnt/nfs/labstore1003-maps/home/warperdata/ (but referred to via /home and /data/project in the application) for storing a fair bit of data too. I'll turn off the webserver during the move, thanks.

Status update on maps: the delta between last sync and when I started one yesterday was too great to finish by this time.
Obviously, at a certainly point, cutting over and then catching up with another sync is the right idea, but a large delta doesn't help.
How about we aim for Friday, May 31 @ 1800 UTC.

@Chippyy will that work for you? I can do my best to handle the service on the tiles server.

Heh. The sync just completed. I restarted it only to find it has another large delta. I think this just needs to be cut over, and then final sync done with no throttling. I can wait for Friday to make sure folks are at least prepared, though, since I had to miss the time for today.

I am going to begin the switch over. I'll check that apache is off for the tiles and warper instances once the patch has started the trickle-down. Worse comes to worse, a reboot of the instances will be needed to ensure services are restored, and I can help with that as well.

Change 509470 merged by Bstorm:
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

slippymap seems to be working (something I know how to check at least). @Chippyy Could you check on warper3? I'm not sure how to verify that I got it right.

Change 513666 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Change 513666 merged by Bstorm:
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Ok. At this point, anything still using the old mount on maps will have a "stale nfs filehandle" message most likely (and basically needs a restart). Apache is not among those in my testing, but maps folks may find their home directory is for long-running ssh sessions or screens, etc.

I have found an issue with the failover mechanism on the new server that may require service restarts or maps server reboots when the NFS is failed over, so I think I'm going to make a subtask to deal with that in the future--I would rather a more seamless failover.

Please reach out to me/create a task and assign it to me if there is any further issues not resolved by simply restarting something or logging in again.

@Bstorm everything looks fine on maps-warper3 thanks and warper.wmflabs.org is running ok