Page MenuHomePhabricator

Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM
Closed, ResolvedPublic

Description

Maintain-dbusers maintains the rights of users to access several different databases. It also generates access credentials and stores them in users' home directories on NFS (in replica.my.cnf).

Right now maintain-dbusers only runs on labstore1004. Soon it will need to run on several different cloud-vps-hosted NFS servers instead.

Right now the script is installed by profile::wmcs::nfs::maintain_dbusers which does a couple of fancy things which rely on being on wikiland/production hardware:

  • Aggregate database uses from puppetdb with query_nodes
  • include passwords::mysql::labsdb and passwords::labsdbaccounts

I can think of a few of ways to move forward with this:

  1. Split out the part of this tool that writes replica.my.cnf into an API hosted on the nfs server; run the rest of maintain-dbusers on a different prod/wikiland host (e.g. a cloudcontrol host) and call out to nfs servers for the little nfs piece.
  1. Adapt the puppet setup so it can run on a VM: duplicate db creds into a project-local puppetmaster, and manually maintain a list of db servers in hiera.
  1. Run the existing maintain-dbusers script on a cloudcontrol; mount paws and tools nfs volumes on that cloudcontrol for replica.my.conf maintenance; drop the chattr -i feature.

Rollout plan (Copied from comment T303663#8500586, with extra T303663#8621555 added)

Rollout plan for T304040: REST api service to manage toolforge replica.my.cnf:

  • Deploy and test the REST API in a testing instance
    • For account_type in [tool, user, paws]:
      • POST /fetch-replica-path
      • POST /write-replica-cnf
      • POST /read-replica-cnf
  • Deploy and test the REST API on labstore1004.eqiad.wmnet
    • Add ::profile::wmcs::services::toolsdb_replica_cnf to role::wmcs::nfs::primary
    • Run puppet
    • Run manual API tests
  • Rebuild paws-nfs-1 VM with bullseye
    • create a new vm with wmcs.nfs.add_server, using project paws, and not creating a new volume nor a service ip (see this)
    • Move the volume from the old nfs vm to the new
  • Deploy and test the REST API on paws-nfs-2.paws.eqiad1.wikimedia.cloud (the new nfs server)
    • Add ::profile::wmcs::services::toolsdb_replica_cnf via the instance's Horizon's puppet tab
    • Run puppet
    • Run manual API tests
  • Manually (or with the functional tests if possible) test that cloudcontrol1005 can reach correctly the APIs on labstore1004, and paws-nfs-1
  • Test thoroughly on cloudcontrol1005
    • Get the script config changes merged in puppet, and deploy in cloudcontrol1005
    • Set profile::wmcs::services::toolsdb_replica_cnf::* hiera variables to point to labstore1004 and paws-nfs-1 as default and PAWS API backends
    • Manually copy the script to cloudcontrol1005
    • Run in dry-run mode
    • Run in singe-account mode
      • tool
      • user
      • paws
    • Work with @Vivian to get https://github.com/toolforge/paws/pull/264 running on a new cluster and test that everything works
  • Announce downtime window for maintain-dbusers. This should only affect new tools and maintainers getting their replica.my.cnf file, so the announcement doesn't need a lot of lead time. We can start preparing folks for the bigger NFS downtime change at the same time if we are confident about timelines too, but that is not 100% necessary.
  • Move paws to the new paws-nfs-1 VM
    • Do a last rsync
    • Sync with @Vivian to deploy the changes/restart the pods
  • Remove maintain-dbusers from labstore1004.eqiad.wmnet:
    • Remove ::profile::wmcs::nfs::maintain_dbusers from role::wmcs::nfs::primary
    • Run puppet
    • Manually clean up the 'maintain-dbusers' systemd service, /etc/dbusers.yaml, and /usr/local/sbin/maintain-dbusers
  • Verify that NFS works on paws
  • Add maintain-dbusers to cloudcontrol1005:
    • Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/809921 to make maintain-dbusers talk to the new REST API
    • Add ::profile::wmcs::nfs::maintain_dbusers to ::role::wmcs::openstack::eqiad1::control
    • Set hiera for profile::wmcs::nfs::primary::cluster_ip to the ip of the cloudcontrol1005 <- is this needed?
    • Set the profile::wms::services::toolsdb_replica_cnf::htpassword and htpassword_salt secrets and test to verify that the test authentication details are no longer in use.
  • Verify that maintain-dbusers works on cloudcontrol1005
  • Notify folks that the maintain-dbusers maintenance is over

At this point we will be ready to start working on the bigger deal steps of building and switching to a new NFS service hosted within the tools project:

  • Build NFS server in tools project with a similar setup as the PAWS project's NFS server
  • Initial NFS content sync
  • Announce Toolforge wide NFS migration downtime window with an estimate of read-only time based on initial sync
  • Wait for NFS migration downtime window
  • Stop maintain-dbusers on cloudcontrol1005
  • Set labstore1004 to read-only
  • Re-sync NFS content
  • Update mounts across Toolforge to use the new NFS server
  • Update hiera to point maintain-dbusers on cloudcontrol1005 at new NFS server instead of labstore1004
  • Start maintain-dbusers on cloudcontrol1005
  • Announce end of NFS migration downtime window
  • Cleanup old labstore1004/1005 grants from m5-master db labstaccounts
  • Profit!!!

Details

SubjectRepoBranchLines +/-
operations/homer/publicmaster+0 -8
operations/puppetproduction+226 -176
operations/puppetproduction+46 -40
operations/puppetproduction+1 -1
operations/puppetproduction+0 -4
operations/puppetproduction+32 -37
operations/puppetproduction+27 -27
operations/puppetproduction+0 -4
operations/puppetproduction+23 -5
operations/puppetproduction+7 -3
operations/homer/publicmaster+1 -0
operations/puppetproduction+17 -1
operations/puppetproduction+48 -0
operations/puppetproduction+3 -3
operations/puppetproduction+6 -6
operations/puppetproduction+14 -1
operations/puppetproduction+2 -1
operations/puppetproduction+15 -3
operations/puppetproduction+2 -0
operations/puppetproduction+83 -27
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks @Andrew for helping rebuild the paws-nfs1 host!

Change 892446 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] puppet: update firewall rules for cloudcontrol1005

https://gerrit.wikimedia.org/r/892446

Change 892446 merged by David Caro:

[operations/puppet@production] puppet: update firewall rules for cloudcontrol1005

https://gerrit.wikimedia.org/r/892446

I know this isn't realistic for the grants, but for firewall rules and puppetization I suggest that this be set up on all cloudcontrol nodes, with an active/passive setup. That will make future hardware refreshes a lot easier.

Change 893760 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolsdb_replica_cnf: configure if we want to redirect to https

https://gerrit.wikimedia.org/r/893760

Change 893760 merged by David Caro:

[operations/puppet@production] toolsdb_replica_cnf: configure if we want to redirect to https

https://gerrit.wikimedia.org/r/893760

dcaro updated the task description. (Show Details)
dcaro added subscribers: Vivian, komla.
dcaro updated the task description. (Show Details)

Change 894225 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers

https://gerrit.wikimedia.org/r/894225

Change 894227 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH

https://gerrit.wikimedia.org/r/894227

Change 894225 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers

https://gerrit.wikimedia.org/r/894225

Change 894227 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH

https://gerrit.wikimedia.org/r/894227

reporting on the issues I noticed while trying to test the refactored maintain-dbusers.py script on the cloudcontrol1005 server.
There where a number of minor issues around module versions but those has been fixed.
The two most important issues are:

  • pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '172.16.7.153' (timed out)") (172.16.7.153 is the ip of clouddb1001.clouddb-services.eqiad1.wikimedia.cloud)
  • usefulness of is_active_nfs function in the script.

for the pymysql.err.OperationalError, I had this discussion with Bryan:

Hello andrewbogott , while testing "maintain-dbusers harvest ......" on cloudcontrol1005,  pymysql throws "timed out" error "can't connect to mysql server on 172.16.7.153" (just found out from Bryan that this ip belongs to cloud clouddb1001.clouddb-services.eqiad1.wikimedia.cloud), and when I try to ping the ip 172.16.7.153 It wasn't successful (which is ok because "not on the same realm" thing). The 
10:09 PM 
interesting thing here is that I can't also ping the ip 172.16.7.153 from the labstore1004 server (sure also "not on the same realm"), even though the labstore1004 server hosts the "working" copy of the old maintain-dbusers.py script. If labstore1004 can't connect to 172.16.7.153, doesn't that mean that something is failing in the "running" copy of the maintain-dbusers.py script?
10:10 PM 
more specifically in the "harvest_replica_accounts" function
10:10 PM 
andrewbogott: 
10:12 PM <bd808> Bryan Davis 
We probably have not tried to harvest accounts since ToolsDB moved inside the Cloud VPS realm.
10:15 PM 
The `harvest_replica_accounts` functionality is basically a fallback to recover information that we would typically expect to be in the credential provisioning database if it is somehow lost or corrupted.

the chat above helped to explain things. But If we are experiencing this issue in the old maintain-dbusers.py (which we are), maybe we should investigate this?

For the is_active_nfs function, here is the conversation I had with Bryan about it:

andrewbogott: around? can you help look at this function "is_active_nfs" https://gerrit.wikimedia.org/r/c/operations/puppet/+/809921/35/modules/profile/files/wmcs/nfs/maintain-dbusers.py#1006 ? do we expect the cloudcontrol1005 server to become the "active nfs server" when all this is deployed? if not then maybe we no longer need this function? 
11:59 PM <bd808> Bryan Davis 
Raymond_Ndibe: no, a cloudcontrol will never be the active NFS server. The point of that check was to have the software actively running on multiple servers and have some way to check to see if a particular copy of it should be active. I would assume that we would want some replacement for running on HA pairs like the cloudcontrol* servers, but it is going to somehow look different. It probably turns into a check against some "active"
11:59 PM 
 flag in hiera data (which is really what that check is in disguise)

Looking at this discussion, we might need a replacement for the is_active_nfs function. one tailored to our new unique setup

So far I've only tested the dry-run option. @Andrew and @dcaro it will be nice if we can test the --only-users option together. This is because we need someone who has better knowledge of how the old maintain-dbusers script works, to be certain that the new changes meet requirement

Change 895818 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] replica_cnf_api: skip tool account that don't have a home

https://gerrit.wikimedia.org/r/895818

Change 895756 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: add nicer logging with dry run prefix

https://gerrit.wikimedia.org/r/895756

Change 895814 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: fix systemd service description

https://gerrit.wikimedia.org/r/895814

Change 895838 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: skip tool accounts that are not ready

https://gerrit.wikimedia.org/r/895838

Change 895818 merged by David Caro:

[operations/puppet@production] replica_cnf_api: skip tool account that don't have a home

https://gerrit.wikimedia.org/r/895818

the chat above helped to explain things. But If we are experiencing this issue in the old maintain-dbusers.py (which we are), maybe we should investigate this?

Yep, it actually connects from labstore1004, but only mysql:

root@labstore1004:~# mysql -h 172.16.7.153 -u labsdbadmin -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 8905170
Server version: 10.1.44-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:labsdbadmin@172.16.7.153 [(none)]>

So that should be fixed yes, looking

I have also found https://gerrit.wikimedia.org/r/c/operations/puppet/+/895838, not a major issue either (it will crash the run, but the next run it would pass to the next missing user).

Looking at this discussion, we might need a replacement for the is_active_nfs function. one tailored to our new unique setup

Agree, I think it does not make sense anymore as it is yes. We can instead enable the service only on the primary host (probably a new hiera entry like primary_maintaindb_host: cloudcontrol1005), so when we reimage/remove 1005 we will just change puppet at the same time.
It can be a parameter to the maintain_dbusers.pp profile.

About the connectivity issues:

root@labstore1004:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  ae3-1119.cr2-eqiad.wikimedia.org (10.64.37.3)  0.468 ms  0.437 ms  0.427 ms
 2  * * *
 3  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.256 ms  0.266 ms  0.257 ms
 4  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.288 ms  0.295 ms  0.264 ms
 5  172.16.7.153 (172.16.7.153)  0.355 ms  0.358 ms  0.357 ms

--- from cloudcontrol1005
root@cloudcontrol1005:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  ae3-1003.cr2-eqiad.wikimedia.org (208.80.154.67)  0.349 ms  0.352 ms  0.329 ms
 2  * * *
 3  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.173 ms  0.209 ms  0.195 ms
 4  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.241 ms  0.268 ms  0.253 ms
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

looking

I think it might be more complex than just firewall, it seems that packets from cloudcontrol1005 are not reaching, packets from labstore1004 reach and go back, and packets from cloudcephmon1001 only arrive, but the replies are lost (that last one was for testing prod net without wikimedia.org domain).

Tested doing a tcpdump of the source ip on clouddb1001:

root@clouddb1001:~# tcpdump -nvvi any host 10.64.20.67
root@clouddb1001:~# tcpdump -nvvi any host 208.80.154.85
root@clouddb1001:~# tcpdump -nvvi any host 10.64.37.19

And the above traceroute on tcp port 3306, and seeing traffic on tcpdump, and the tracerout output:

root@cloudcephmon1001:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  * * *
 2  xe-3-0-4-1100.cr2-eqiad.eqiad.wmnet (10.64.147.14)  0.310 ms  0.313 ms  0.299 ms
 3  * * *
 4  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.171 ms  0.170 ms  0.148 ms
 5  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.160 ms  0.179 ms  0.164 ms
 6  * * *
...
30  * * *

Maybe @cmooney or @ayounsi can help here? (tl;dr; we want cloudcontrol1005/1006/1007 to have the same access to 172.16.7.153 that labstore1004 has)
Or @aborrero might know if it's on the cloudgw side or similar.

I think it might be vlan/routing related.

  • labstore1004 is in the vlan cloud-support -> works
  • cloudcontrol1005 is in the public vlan -> does not work, nothing reaches clouddb1001
  • cloudcephmon1001 is in the cloud-hosts vlan -> does not work, but packets reach clouddb1001, just don't come back

Change 896051 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: permit toolsdb return traffic to cloudcontrols

https://gerrit.wikimedia.org/r/896051

For the connectivity, a flotaing IP has been created for clouddb1001 (185.15.56.15), we will use that from now on, that also makes us respect the current network standards

Change 895838 abandoned by David Caro:

[operations/puppet@production] maintain-dbusers: skip tool accounts that are not ready

Reason:

Merged into I1c586de78c2e81fa09843f963dbf9b591c7b0628

https://gerrit.wikimedia.org/r/895838

Change 896051 abandoned by Majavah:

[operations/homer/public@master] cr-cloud: permit toolsdb return traffic to cloudcontrols

Reason:

https://gerrit.wikimedia.org/r/896051

Change 895756 merged by David Caro:

[operations/puppet@production] maintain-dbusers: add nicer logging with dry run prefix

https://gerrit.wikimedia.org/r/895756

dcaro updated the task description. (Show Details)

Change 898852 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] replica_cnf: skip toolforge users without a home

https://gerrit.wikimedia.org/r/898852

Change 898852 merged by David Caro:

[operations/puppet@production] replica_cnf: skip toolforge users without a home

https://gerrit.wikimedia.org/r/898852

dcaro updated the task description. (Show Details)

Change 899532 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: replace icinga check with prometheus one

https://gerrit.wikimedia.org/r/899532

Change 899532 merged by David Caro:

[operations/puppet@production] maintain_dbusers: remove icinga alert, we'll use the default one

https://gerrit.wikimedia.org/r/899532

Change 899662 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: move out of nfs to services

https://gerrit.wikimedia.org/r/899662

Change 899663 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: Remove unused param and adapt to best practices

https://gerrit.wikimedia.org/r/899663

dcaro updated the task description. (Show Details)

Change 902815 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: run isort and black and use pep563 types

https://gerrit.wikimedia.org/r/902815

Change 902816 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintaint-dbusers: refactor

https://gerrit.wikimedia.org/r/902816

Change 904161 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs::nfs::primary: remove unused mysql_variances hiera

https://gerrit.wikimedia.org/r/904161

Change 899662 merged by David Caro:

[operations/puppet@production] maintain_dbusers: move out of nfs to services

https://gerrit.wikimedia.org/r/899662

Change 899663 merged by David Caro:

[operations/puppet@production] maintain_dbusers: Remove unused param and adapt to best practices

https://gerrit.wikimedia.org/r/899663

Change 904161 merged by David Caro:

[operations/puppet@production] wmcs::nfs::primary: remove unused mysql_variances hiera

https://gerrit.wikimedia.org/r/904161

dcaro updated the task description. (Show Details)

Change 895814 merged by David Caro:

[operations/puppet@production] maintain-dbusers: fix systemd service description

https://gerrit.wikimedia.org/r/895814

Change 902815 merged by David Caro:

[operations/puppet@production] maintain-dbusers: run isort and black and use pep563 types

https://gerrit.wikimedia.org/r/902815

Change 902816 merged by David Caro:

[operations/puppet@production] maintain-dbusers: refactor

https://gerrit.wikimedia.org/r/902816

Change 907132 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: remove clouddb_return term

https://gerrit.wikimedia.org/r/907132

Change 907132 merged by jenkins-bot:

[operations/homer/public@master] cr-cloud: remove clouddb_return term

https://gerrit.wikimedia.org/r/907132

dcaro changed the task status from Open to In Progress.Apr 26 2023, 8:00 AM
dcaro moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.