Page MenuHomePhabricator

Set up scratch and maps NFS services on cloudstore1008/9
Open, NormalPublic

Description

Now that these stretch cloudstore servers are images and racked, we need to construct a puppetization that involves replication (DRBD not needed, just periodic data replication), some sense of failover, and NFS services.

The current provider of these services is labstore1003, which is to be decommissioned as soon as this setup is live for those purposes.

Note that part of the existing storage on these new boxes is to enable a future sync service for data backup, etc that will not be on NFS.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It might be that the only confusion was backup jobs anyway. We can talk more when you are back in general. Don't worry too much. I hope the brain-dump above helps, though.

Change 485375 abandoned by GTirloni:
wmcs::nfs::misc - Backup for misc server (cloudstore1008)

Reason:
Misunderstanding of project requirements in T209527. I'll have to start from scratch with a new solution.

https://gerrit.wikimedia.org/r/485375

Bstorm claimed this task.Apr 1 2019, 11:51 PM
Bstorm removed a subscriber: GTirloni.

Change 500635 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere

https://gerrit.wikimedia.org/r/500635

Change 500635 merged by Bstorm:
[operations/puppet@production] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere

https://gerrit.wikimedia.org/r/500635

Change 500801 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: start refactor for role switch up around the labstores

https://gerrit.wikimedia.org/r/500801

Change 500801 merged by Bstorm:
[operations/puppet@production] cloudstore: start refactor for role switch up around the labstores

https://gerrit.wikimedia.org/r/500801

Change 501066 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix mistake in maintain_dbusers service

https://gerrit.wikimedia.org/r/501066

Change 501066 merged by Bstorm:
[operations/puppet@production] labstore: fix mistake in maintain_dbusers service

https://gerrit.wikimedia.org/r/501066

Change 501070 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: cleanup the remaining files after Icc89332f0e779

https://gerrit.wikimedia.org/r/501070

Change 501070 merged by Bstorm:
[operations/puppet@production] labstore: cleanup the remaining files after Icc89332f0e779

https://gerrit.wikimedia.org/r/501070

Change 501434 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add extension and get nfs-manage-binds passing linter

https://gerrit.wikimedia.org/r/501434

Change 501434 merged by Bstorm:
[operations/puppet@production] cloudstore: add extension and get nfs-manage-binds passing linter

https://gerrit.wikimedia.org/r/501434

Change 501446 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: A bit more cleanup

https://gerrit.wikimedia.org/r/501446

Change 501451 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Adapt nfs-exportd to be used on more than one cluster

https://gerrit.wikimedia.org/r/501451

Change 501451 merged by Bstorm:
[operations/puppet@production] labstore: Adapt nfs-exportd to be used on more than one cluster

https://gerrit.wikimedia.org/r/501451

Change 501694 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Adapt nfs-exportd to be used on more than one cluster

https://gerrit.wikimedia.org/r/501694

Change 501694 merged by Bstorm:
[operations/puppet@production] labstore: Adapt nfs-exportd to be used on more than one cluster

https://gerrit.wikimedia.org/r/501694

Change 501446 merged by Bstorm:
[operations/puppet@production] cloudstore: A bit more cleanup

https://gerrit.wikimedia.org/r/501446

Change 502342 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: deploy maps/scratch cluster as nfs::secondary

https://gerrit.wikimedia.org/r/502342

Change 502344 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: a touch more cleanup of the secondary modules

https://gerrit.wikimedia.org/r/502344

Change 502344 merged by Bstorm:
[operations/puppet@production] labstore: a touch more cleanup of the secondary modules

https://gerrit.wikimedia.org/r/502344

Change 503123 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: refactor the backup roles so they will match the main roles

https://gerrit.wikimedia.org/r/503123

Change 503123 merged by Bstorm:
[operations/puppet@production] labstore: refactor the backup roles so they will match the main roles

https://gerrit.wikimedia.org/r/503123

Change 502342 merged by Bstorm:
[operations/puppet@production] cloudstore: deploy maps/scratch cluster as nfs::secondary

https://gerrit.wikimedia.org/r/502342

Bstorm removed a subscriber: aborrero.Apr 19 2019, 7:38 PM

Change 505325 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix up some params around the rsync jobs

https://gerrit.wikimedia.org/r/505325

Change 505325 merged by Bstorm:
[operations/puppet@production] cloudstore: fix up some params around the rsync jobs

https://gerrit.wikimedia.org/r/505325

Change 505333 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505333 merged by Bstorm:
[operations/puppet@production] cloudstore: change version to newton for cloudstore1008/9

https://gerrit.wikimedia.org/r/505333

Change 505339 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: try setting the openstack version differently

https://gerrit.wikimedia.org/r/505339

Change 505339 merged by Bstorm:
[operations/puppet@production] cloudstore: add python3 clientpackages for all

https://gerrit.wikimedia.org/r/505339

Change 506319 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

For the hard mounts, going to use 208.80.155.119/2620:0:861:4:208:80:155:119/nfs-maps.wikimedia.org as a floating IP.

Change 506319 merged by Bstorm:
[operations/puppet@production] cloudstore: refactor nfsclient role into profile

https://gerrit.wikimedia.org/r/506319

Change 506472 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506472 merged by Bstorm:
[operations/puppet@production] cloudstore: in stretch, the location of default nsswitch is different

https://gerrit.wikimedia.org/r/506472

Change 506714 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

This is nearing completion in terms of the functional components. I am sorry to say that travel next week is going to stall the data migration and cutover piece.

For correct status, the rsync jobs are not ready just yet either.

Change 506714 merged by Bstorm:
[operations/puppet@production] cloudstore: fail over ip address via hiera for scratch/maps cloudstore

https://gerrit.wikimedia.org/r/506714

Change 506721 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506721 merged by Bstorm:
[operations/puppet@production] cloudstore: fix the interface name and add a comment

https://gerrit.wikimedia.org/r/506721

Change 506738 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506751 merged by Bstorm:
[operations/puppet@production] cloudstore: add ping check for ip conflict

https://gerrit.wikimedia.org/r/506751

Change 506738 merged by Bstorm:
[operations/puppet@production] cloudstore: test failover of cloudstore1008 to cloudstore1009

https://gerrit.wikimedia.org/r/506738

Change 506847 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 506847 merged by Bstorm:
[operations/puppet@production] cloudstore: introduce rsync framework for secondary cluster

https://gerrit.wikimedia.org/r/506847

Change 507094 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507094 merged by Bstorm:
[operations/puppet@production] cloudstore: add ferm rules for rsync on the scratch/maps cluster

https://gerrit.wikimedia.org/r/507094

Change 507097 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507097 merged by Bstorm:
[operations/puppet@production] cloudstore: correct problems in ferm rules

https://gerrit.wikimedia.org/r/507097

Change 507104 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507104 merged by Bstorm:
[operations/puppet@production] cloudstore: edit ferm rules a bit more

https://gerrit.wikimedia.org/r/507104

Change 507206 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507206 merged by Bstorm:
[operations/puppet@production] cloudstore: change direction a bit on the rsync methods

https://gerrit.wikimedia.org/r/507206

Change 507212 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507212 merged by Bstorm:
[operations/puppet@production] cloudstore: finish up the script for sync

https://gerrit.wikimedia.org/r/507212

Change 507213 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507213 merged by Bstorm:
[operations/puppet@production] cloudstore: cleanup extraneous bits

https://gerrit.wikimedia.org/r/507213

Change 507216 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507216 merged by Bstorm:
[operations/puppet@production] cloudstore: the cluster ip must be passed through

https://gerrit.wikimedia.org/r/507216

Change 507220 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507220 merged by Bstorm:
[operations/puppet@production] cloudstore: correct python syntax

https://gerrit.wikimedia.org/r/507220

Change 507222 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507222 merged by Bstorm:
[operations/puppet@production] cloudstore: touch up the script a bit from testing

https://gerrit.wikimedia.org/r/507222

Change 507227 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507227 merged by Bstorm:
[operations/puppet@production] cloudstore: add to the script for syncserver

https://gerrit.wikimedia.org/r/507227

Change 507229 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507229 merged by Bstorm:
[operations/puppet@production] cloudstore: add to role for the syncing

https://gerrit.wikimedia.org/r/507229

Change 507232 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Change 507232 merged by Bstorm:
[operations/puppet@production] cloudstore: fix one more mistake in syncserver

https://gerrit.wikimedia.org/r/507232

Ok. This all seems to work now. I'm prepared to set up a patch to change the client mounts and start sync jobs to migrate the data. That will wait until I get back, I imagine.

Change 509458 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 509469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch scratch mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509469

Change 509470 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: switch maps mounts from labstore1003 to cloudstore1008

https://gerrit.wikimedia.org/r/509470

Change 509458 merged by Dzahn:
[operations/puppet@production] rsync: add a bwlimit option for quickdatacopy

https://gerrit.wikimedia.org/r/509458

Change 510185 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

@aude @Awjrichards @Chippyy @cmarqu @coren @dschwen @jeremyb @Kolossos @MaxSem @Multichill @Nosy @TheDJ -- Just a heads up that I'm looking to begin data migration now for maps /home and /project. To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

It's a lot of data, so if the copy starts thrashing performance too much, let me know and I can try to reduce speed or something.

Change 510185 merged by Bstorm:
[operations/puppet@production] cloudstore: start syncing data off labstore1003

https://gerrit.wikimedia.org/r/510185

Change 510259 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510259 merged by Bstorm:
[operations/puppet@production] labstore: shut down old rsync and export materials for migration

https://gerrit.wikimedia.org/r/510259

Change 510262 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510262 merged by Bstorm:
[operations/puppet@production] labstore: fix a couple errors and tidy mounts for migration

https://gerrit.wikimedia.org/r/510262

Change 510264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

Change 510264 merged by Bstorm:
[operations/puppet@production] rsync: equal sign was removed from quickdatacopy bwlimit

https://gerrit.wikimedia.org/r/510264

TheDJ added a comment.Wed, May 15, 7:08 PM
To do the final cutover, things logged into the maps project servers will probably need to close sessions, so I wanted to be in touch for that step as well since that needs to be scheduled.

nfs sessions you mean ?

@TheDJ ssh sessions and possibly processes that run out of home directories or the project directory on NFS. Because it has NFS home directories, you'd want to make sure you re-opened your home directory after the symlink to /home is changed to the new mounts.

Change 510761 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 510761 merged by Bstorm:
[operations/puppet@production] labstore: treat maps mounts differently for efficiency

https://gerrit.wikimedia.org/r/510761

Change 511445 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Change 511445 merged by Bstorm:
[operations/puppet@production] cloudstore: increase sync speed for data migration

https://gerrit.wikimedia.org/r/511445

Scratch has now had one successful sync. Setting the patch to review and finding a reasonable date for it. Theoretically, since scratch shouldn't have a lot of open filehandles, it shouldn't be too bad as long as everything is working right.

Scheduled scratch migration for 2019-05-28@1800 UTC