Maniphest T209527

Set up scratch and maps NFS services on cloudstore1008/9
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Nov 14 2018, 7:43 PM

Description

Now that these stretch cloudstore servers are images and racked, we need to construct a puppetization that involves replication (DRBD not needed, just periodic data replication), some sense of failover, and NFS services.

The current provider of these services is labstore1003, which is to be decommissioned as soon as this setup is live for those purposes.

Note that part of the existing storage on these new boxes is to enable a future sync service for data backup, etc that will not be on NFS.

Details

Subject	Repo	Branch	Lines +/-
cloudstore: scratch is a rw mount	operations/puppet	production	+1 -1
cloudstore: switch maps mounts from labstore1003 to cloudstore1008	operations/puppet	production	+8 -9
cloudstore: switch scratch mounts from labstore1003 to cloudstore1008	operations/puppet	production	+19 -39
cloudstore: increase sync speed for data migration	operations/puppet	production	+2 -2
labstore: treat maps mounts differently for efficiency	operations/puppet	production	+7 -1
rsync: equal sign was removed from quickdatacopy bwlimit	operations/puppet	production	+1 -1
labstore: fix a couple errors and tidy mounts for migration	operations/puppet	production	+18 -18
labstore: shut down old rsync and export materials for migration	operations/puppet	production	+22 -23
cloudstore: start syncing data off labstore1003	operations/puppet	production	+16 -1
rsync: add a bwlimit option for quickdatacopy	operations/puppet	production	+4 -1
cloudstore: fix one more mistake in syncserver	operations/puppet	production	+0 -1
cloudstore: add to role for the syncing	operations/puppet	production	+1 -0
cloudstore: add to the script for syncserver	operations/puppet	production	+17 -3
cloudstore: touch up the script a bit from testing	operations/puppet	production	+3 -1
cloudstore: correct python syntax	operations/puppet	production	+1 -1
cloudstore: the cluster ip must be passed through	operations/puppet	production	+3 -0
cloudstore: cleanup extraneous bits	operations/puppet	production	+0 -2
cloudstore: finish up the script for sync	operations/puppet	production	+12 -11
cloudstore: change direction a bit on the rsync methods	operations/puppet	production	+2 -57
cloudstore: edit ferm rules a bit more	operations/puppet	production	+2 -1
cloudstore: correct problems in ferm rules	operations/puppet	production	+3 -9
cloudstore: add ferm rules for rsync on the scratch/maps cluster	operations/puppet	production	+21 -1
cloudstore: introduce rsync framework for secondary cluster	operations/puppet	production	+62 -11
cloudstore: test failover of cloudstore1008 to cloudstore1009	operations/puppet	production	+1 -1
cloudstore: add ping check for ip conflict	operations/puppet	production	+4 -1
cloudstore: fix the interface name and add a comment	operations/puppet	production	+2 -1
cloudstore: fail over ip address via hiera for scratch/maps cloudstore	operations/puppet	production	+27 -4
cloudstore: in stretch, the location of default nsswitch is different	operations/puppet	production	+1 -1
cloudstore: refactor nfsclient role into profile	operations/puppet	production	+42 -7
cloudstore: add python3 clientpackages for all	operations/puppet	production	+14 -1
cloudstore: change version to newton for cloudstore1008/9	operations/puppet	production	+1 -1
cloudstore: fix up some params around the rsync jobs	operations/puppet	production	+10 -4
cloudstore: deploy maps/scratch cluster as nfs::secondary	operations/puppet	production	+355 -632
labstore: refactor the backup roles so they will match the main roles	operations/puppet	production	+45 -31
labstore: a touch more cleanup of the secondary modules	operations/puppet	production	+1 -1
cloudstore: A bit more cleanup	operations/puppet	production	+3 -15
labstore: Adapt nfs-exportd to be used on more than one cluster	operations/puppet	production	+20 -12
labstore: Adapt nfs-exportd to be used on more than one cluster	operations/puppet	production	+9 -5
cloudstore: add extension and get nfs-manage-binds passing linter	operations/puppet	production	+19 -18
labstore: cleanup the remaining files after Icc89332f0e779	operations/puppet	production	+0 -972
labstore: fix mistake in maintain_dbusers service	operations/puppet	production	+3 -4
cloudstore: start refactor for role switch up around the labstores	operations/puppet	production	+988 -1
cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere	operations/puppet	production	+3 -6
wmcs::nfs::misc - Backup for misc server (cloudstore1008)	operations/puppet	production	+82 -2
wmcs::nfs::misc - Fix ferm rule	operations/puppet	production	+1 -1
wmcs::nfs::misc - Allow incoming SSH from backup server	operations/puppet	production	+10 -0
wmcs::nfs - Refactor snapshot_manager	operations/puppet	production	+29 -13
wmcs::nfs::misc - Fix roles	operations/puppet	production	+6 -0
block_sync - Adjust SSH private key filename	operations/puppet	production	+3 -3
wmcs::nfs::misc - Fix sshd config	operations/puppet	production	+1 -1
wmcs::nfs::misc - Fixes and backup role	operations/puppet	production	+272 -11
cloudstore1008/9: reimage with buster	operations/puppet	production	+4 -0
wmcs::nfs::misc - Disable notifications temporarily	operations/puppet	production	+2 -0
labstore - Allow multiple bdsync jobs per host	operations/puppet	production	+5 -5
labstore: Add id_cloudstore	labs/private	master	+0 -0
labstore::device_backup - Expose systemd OnCalendar syntax	operations/puppet	production	+12 -16
wmcs::nfs::misc - Remove unused /srv/* exports	operations/puppet	production	+0 -2
Rebuild for Stretch and add .gitreview	operations/debs/nfsd-ldap	master	+11 -0
wmcs::nfs::misc - Add nfsd-ldap package back	operations/puppet	production	+1 -0
Revert "wmcs::nfs::misc - Remove wmcs-root from admin groups"	operations/puppet	production	+3 -0
wmcs::nfs::misc - Remove wmcs-root from admin groups	operations/puppet	production	+0 -3
wmcs::nfs::misc - Second attempt to fix nsswitch.conf	operations/puppet	production	+9 -1
wmcs::nfs::misc - Configure nsswitch.conf	operations/puppet	production	+21 -0
wmcs::nfs::misc - Fix typo and nsswitch.conf file	operations/puppet	production	+1 -3
wmcs::nfs::misc - Refactor into profile/role	operations/puppet	production	+544 -1
Set spare role for cloustore1008/1009	operations/puppet	production	+5 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T216208 ToolsDB overload and cleanup
Resolved	aborrero	T216769 wmcs: audit old hardware
Resolved	bd808	T216747 Decommission outdated and risky hardware
Resolved	Jclark-ctr	T187456 Decommission labstore100[123] and their disk shelves
		Unknown Object (Task)
Open	None	T209530 Build user data backup service based on remote sync rather than NFS
Resolved	• Bstorm	T193655 rack/setup/install cloudstore1008 & cloudstore1009
Resolved	• Bstorm	T209527 Set up scratch and maps NFS services on cloudstore1008/9
Resolved	• Bstorm	T221806 Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9
Open	None	T224510 Document the new NFS setup on cloudstore1008/9
Resolved	• Bstorm	T224747 Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5
Resolved	Jclark-ctr	T266192 Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5
		Unknown Object (Task)
Resolved	• Bstorm	T285981 /data/scratch NFS space too high
Resolved	zhuyifei1999	T285982 video2commons needs cleanup on /data/scratch following NFS changes
Resolved	Nemo_bis	T285983 Please clean up data in /data/scratch/tmp
Resolved	• Bstorm	T224914 Getent check apparently isn't working on the new cloudstore1008/9 servers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 505333 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505333 merged by Bstorm:
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: try setting the openstack version differently

https://gerrit.wikimedia.org/r/505339

Change 505339 merged by Bstorm:
[operations/puppet@production] cloudstore: add python3 clientpackages for all

https://gerrit.wikimedia.org/r/505339

Change 506319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

For the hard mounts, going to use 208.80.155.119/2620:0:861:4:208:80:155:119/nfs-maps.wikimedia.org as a floating IP.

Change 506319 merged by Bstorm:
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

Change 506472 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506472 merged by Bstorm:
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506714 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

This is nearing completion in terms of the functional components. I am sorry to say that travel next week is going to stall the data migration and cutover piece.

For correct status, the rsync jobs are not ready just yet either.

Change 506714 merged by Bstorm:
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

Change 506721 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506721 merged by Bstorm:
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506738 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506751 merged by Bstorm:
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506738 merged by Bstorm:
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506847 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 506847 merged by Bstorm:
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 507094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507094 merged by Bstorm:
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507097 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507097 merged by Bstorm:
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507104 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507104 merged by Bstorm:
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507206 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507206 merged by Bstorm:
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507212 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507212 merged by Bstorm:
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507213 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507213 merged by Bstorm:
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507216 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507216 merged by Bstorm:
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507220 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507220 merged by Bstorm:
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507222 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507222 merged by Bstorm:
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507227 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507227 merged by Bstorm:
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507229 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507229 merged by Bstorm:
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507232 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Change 507232 merged by Bstorm:
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Ok. This all seems to work now. I'm prepared to set up a patch to change the client mounts and start sync jobs to migrate the data. That will wait until I get back, I imagine.

Change 509458 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 509469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

Change 509470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

Change 509458 merged by Dzahn:
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 510185 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

@aude @Awjrichards @Chippyy @cmarqu @coren @dschwen @jeremyb @Kolossos @MaxSem @Multichill @Nosy @TheDJ -- Just a heads up that I'm looking to begin data migration now for maps /home and /project. To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

It's a lot of data, so if the copy starts thrashing performance too much, let me know and I can try to reduce speed or something.

Change 510185 merged by Bstorm:
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

Change 510259 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510259 merged by Bstorm:
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510262 merged by Bstorm:
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

Change 510264 merged by Bstorm:
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

nfs sessions you mean ?

@TheDJ ssh sessions and possibly processes that run out of home directories or the project directory on NFS. Because it has NFS home directories, you'd want to make sure you re-opened your home directory after the symlink to /home is changed to the new mounts.

Change 510761 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 510761 merged by Bstorm:
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 511445 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Change 511445 merged by Bstorm:
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

• Bstorm mentioned this in T223902: cloudcontrol: decide on FQDN for service endpoints.May 21 2019, 4:44 PM

Scratch has now had one successful sync. Setting the patch to review and finding a reasonable date for it. Theoretically, since scratch shouldn't have a lot of open filehandles, it shouldn't be too bad as long as everything is working right.

Scheduled scratch migration for 2019-05-28@1800 UTC

The maps share finished first rsync this weekend. It can be scheduled for any time now, pretty much, as long as I keep it syncing.

Maps folks: since tearing down this server is a big priority, would you be terribly upset if I tried to swap the mounts as soon as this Wednesday? It would mean:
https://phabricator.wikimedia.org/T209527#5187505

Sorry, it's early: Wednesday is quite soon to even get an answer, but as soon as possible would be my preference. The new servers are more stable, redundant, faster in every way. More importantly, labstore1003 need to turn off.

@Bstorm I do maps-warper3 instance.

Okay with me for Wednesday, can't speak for others. Thank you for keeping us up to date with the progress and letting us know when the change is due to happen!
I'll be sure to have a look once the changes are complete and see if everything is working or not (and report back if things are not!)

Change 509469 merged by Bstorm:
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

Mentioned in SAL (#wikimedia-cloud) [2019-05-28T18:14:18Z] <bstorm_> T209527 switched mounts from labstore1003 to cloudstore1008 for scratch

In T209527#5218208, @TheDJ wrote:

@Bstorm i'm super busy with work and no time to analyse potential impact. Suggest crossing fingers and pray.

I'd mostly expect anything running directly out of NFS to require a restart. It seems most of the NFS for maps is dumps and more static content outside of what's in /home? If you aren't logged in, and you don't run much out of /data/project/maps directly, you'll probably not notice :)

In most cases I'd try to give a week's notice at least, but I'm blocking other people's work quite a bit with this migration (and, honestly, you're data will be far safer when this is complete).

Well, I think I might try to merge and switch over the mounts for maps tomorrow at 1800 UTC. Things went fairly smoothly for the scratch migration.

The one thing that is giving me pause on that idea is that there is a significant delta in /data/project/tiles/bw-mapnik. That may be a service running directly on the NFS? If so, that'll have some issues when I do the switch.

I can stop apache on the maps-tiles1 server during the maintenance, if that is helpful/if that's what is running in the project folder. @TheDJ

Hmm. Just confirmed that is true. Apache should be stopped there before the switch and then started after. I can take care of doing that if the maps group isn't available to. The switch itself should be very quick, and I'll be running puppet manually on all nodes together. I will also do a final rsync at the end. Some things might bite the dust when I unmount the old thing if they are holding open the NFS, that's all. Apache will at very least require a restart on that server. maps-wma doesn't seem to use the NFS directly for apache. I'll also try to be available to help with any remediation after this change. So far, the tile server is the main thing I'm able to identify as an issue. I apologize for moving so fast now that the data is transferring fully. That old server is a huge risk.

maps-warper3 which runs https://warper.wmflabs.org/ uses /mnt/nfs/labstore1003-maps/project/warper/uploads/ and /mnt/nfs/labstore1003-maps/home/warperdata/ (but referred to via /home and /data/project in the application) for storing a fair bit of data too. I'll turn off the webserver during the move, thanks.

Status update on maps: the delta between last sync and when I started one yesterday was too great to finish by this time.
Obviously, at a certainly point, cutting over and then catching up with another sync is the right idea, but a large delta doesn't help.
How about we aim for Friday, May 31 @ 1800 UTC.

@Chippyy will that work for you? I can do my best to handle the service on the tiles server.

Heh. The sync just completed. I restarted it only to find it has another large delta. I think this just needs to be cut over, and then final sync done with no throttling. I can wait for Friday to make sure folks are at least prepared, though, since I had to miss the time for today.

Krenair subscribed.May 31 2019, 8:59 AM

I am going to begin the switch over. I'll check that apache is off for the tiles and warper instances once the patch has started the trickle-down. Worse comes to worse, a reboot of the instances will be needed to ensure services are restored, and I can help with that as well.

Change 509470 merged by Bstorm:
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

slippymap seems to be working (something I know how to check at least). @Chippyy Could you check on warper3? I'm not sure how to verify that I got it right.

Change 513666 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Change 513666 merged by Bstorm:
[operations/puppet@production] cloudstore: scratch is a rw mount

https://gerrit.wikimedia.org/r/513666

Ok. At this point, anything still using the old mount on maps will have a "stale nfs filehandle" message most likely (and basically needs a restart). Apache is not among those in my testing, but maps folks may find their home directory is for long-running ssh sessions or screens, etc.

I have found an issue with the failover mechanism on the new server that may require service restarts or maps server reboots when the NFS is failed over, so I think I'm going to make a subtask to deal with that in the future--I would rather a more seamless failover.

Please reach out to me/create a task and assign it to me if there is any further issues not resolved by simply restarting something or logging in again.

@Bstorm everything looks fine on maps-warper3 thanks and warper.wmflabs.org is running ok

faidon awarded a token.Jun 1 2019, 11:21 AM

• Bstorm closed subtask T224914: Getent check apparently isn't working on the new cloudstore1008/9 servers as Resolved.Jun 3 2019, 5:37 PM

• Bstorm changed the status of subtask T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 from Open to Stalled.Aug 6 2019, 1:57 PM

• Bstorm changed the status of subtask T224510: Document the new NFS setup on cloudstore1008/9 from Open to Stalled.Jun 2 2020, 4:42 PM

• Bstorm changed the status of subtask T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 from Stalled to Open.Jun 11 2020, 11:17 PM

Maintenance_bot removed a project: Patch-For-Review.Jun 12 2020, 12:10 AM

• Bstorm closed subtask T224747: Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5 as Resolved.Jul 14 2021, 11:12 PM

• Bstorm changed the status of subtask T224510: Document the new NFS setup on cloudstore1008/9 from Stalled to Open.Jul 14 2021, 11:15 PM