Maintain-dbusers maintains the rights of users to access several different databases. It also generates access credentials and stores them in users' home directories on NFS (in replica.my.cnf).
Right now maintain-dbusers only runs on labstore1004. Soon it will need to run on several different cloud-vps-hosted NFS servers instead.
Right now the script is installed by profile::wmcs::nfs::maintain_dbusers which does a couple of fancy things which rely on being on wikiland/production hardware:
- Aggregate database uses from puppetdb with query_nodes
- include passwords::mysql::labsdb and passwords::labsdbaccounts
I can think of a few of ways to move forward with this:
- Split out the part of this tool that writes replica.my.cnf into an API hosted on the nfs server; run the rest of maintain-dbusers on a different prod/wikiland host (e.g. a cloudcontrol host) and call out to nfs servers for the little nfs piece.
- Adapt the puppet setup so it can run on a VM: duplicate db creds into a project-local puppetmaster, and manually maintain a list of db servers in hiera.
- Run the existing maintain-dbusers script on a cloudcontrol; mount paws and tools nfs volumes on that cloudcontrol for replica.my.conf maintenance; drop the chattr -i feature.
Rollout plan (Copied from comment T303663#8500586, with extra T303663#8621555 added)
Rollout plan for T304040: REST api service to manage toolforge replica.my.cnf:
- Deploy and test the REST API in a testing instance
- For account_type in [tool, user, paws]:
- POST /fetch-replica-path
- POST /write-replica-cnf
- POST /read-replica-cnf
- For account_type in [tool, user, paws]:
- Deploy and test the REST API on labstore1004.eqiad.wmnet
- Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary
- Run puppet
- Run manual API tests
- Rebuild paws-nfs-1 VM with bullseye
- create a new vm with wmcs.nfs.add_server, using project paws, and not creating a new volume nor a service ip (see this)
- Move the volume from the old nfs vm to the new
- Deploy and test the REST API on paws-nfs-2.paws.eqiad1.wikimedia.cloud (the new nfs server)
- Add ::profile::wmcs::services::toolsdb_replica_cnf via the instance's Horizon's puppet tab
- Run puppet
- Run manual API tests
- Prepare database grants and firewall rules beforehand for maintain-dbusers being in cloudcontrol1005.
- Grants on the accounts db (m5-master.eqiad.wmnet)
- Grants on the wikireplicas (see https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_4:_setting_up_GRANTs, T331014)
- Firewall access to wikireplicas (https://gerrit.wikimedia.org/r/c/operations/puppet/+/892446)
- Manually (or with the functional tests if possible) test that cloudcontrol1005 can reach correctly the APIs on labstore1004, and paws-nfs-1
- Test thoroughly on cloudcontrol1005
- Get the script config changes merged in puppet, and deploy in cloudcontrol1005
- Set profile::wmcs::services::toolsdb_replica_cnf::* hiera variables to point to labstore1004 and paws-nfs-1 as default and PAWS API backends
- Manually copy the script to cloudcontrol1005
- Run in dry-run mode
- Run in singe-account mode
- tool
- user
- paws
- Work with @Vivian to get https://github.com/toolforge/paws/pull/264 running on a new cluster and test that everything works
- Announce downtime window for maintain-dbusers. This should only affect new tools and maintainers getting their replica.my.cnf file, so the announcement doesn't need a lot of lead time. We can start preparing folks for the bigger NFS downtime change at the same time if we are confident about timelines too, but that is not 100% necessary.
- Move paws to the new paws-nfs-1 VM
- Do a last rsync
- Sync with @Vivian to deploy the changes/restart the pods
- Remove maintain-dbusers from labstore1004.eqiad.wmnet:
- Remove ::profile::wmcs::nfs::maintain_dbusers from role::wmcs::nfs::primary
- Run puppet
- Manually clean up the 'maintain-dbusers' systemd service, /etc/dbusers.yaml, and /usr/local/sbin/maintain-dbusers
- Verify that NFS works on paws
- Add maintain-dbusers to cloudcontrol1005:
- Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/809921 to make maintain-dbusers talk to the new REST API
- Add ::profile::wmcs::nfs::maintain_dbusers to ::role::wmcs::openstack::eqiad1::control
- Set hiera for profile::wmcs::nfs::primary::cluster_ip to the ip of the cloudcontrol1005 <- is this needed?
- Set the profile::wms::services::toolsdb_replica_cnf::htpassword and htpassword_salt secrets and test to verify that the test authentication details are no longer in use.
- Verify that maintain-dbusers works on cloudcontrol1005
- Notify folks that the maintain-dbusers maintenance is over
- Cleanup puppet code leftovers from maintain_dbusers being under nfs
- Cleanup service file from labstore1004
- Icinga alerts should not happen on non-primary coludcontrols
- Cleanup old icinga alerts from labstore1004/1005
- Move maintain_dbusers out of nfs profile (probably under services is better) (https://gerrit.wikimedia.org/r/c/operations/puppet/+/899662)
- Remove unused variables (https://gerrit.wikimedia.org/r/c/operations/puppet/+/899663)
- Refactor the maintaindbusers code a bit
- Avoid doing one request to mediawiki for each paws user if --only-user is passed
- Create functions for repeated chunks of code
- Minimize nested indentations
- use the cursor.execute method kwargs instead of string.format to avoid escaping issues
- Update the documentation on wikitech https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage#maintain-dbusers
At this point we will be ready to start working on the bigger deal steps of building and switching to a new NFS service hosted within the tools project:
- Build NFS server in tools project with a similar setup as the PAWS project's NFS server
- Initial NFS content sync
- Announce Toolforge wide NFS migration downtime window with an estimate of read-only time based on initial sync
- Wait for NFS migration downtime window
- Stop maintain-dbusers on cloudcontrol1005
- Set labstore1004 to read-only
- Re-sync NFS content
- Update mounts across Toolforge to use the new NFS server
- Update hiera to point maintain-dbusers on cloudcontrol1005 at new NFS server instead of labstore1004
- Start maintain-dbusers on cloudcontrol1005
- Announce end of NFS migration downtime window
- Cleanup old labstore1004/1005 grants from m5-master db labstaccounts
- Profit!!!