⚓ T303663 Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM

Subject	Repo	Branch	Lines +/-
cr-cloud: remove clouddb_return term	operations/homer/public	master	+0 -8
maintain-dbusers: refactor	operations/puppet	production	+226 -176
maintain-dbusers: run isort and black and use pep563 types	operations/puppet	production	+46 -40
maintain-dbusers: fix systemd service description	operations/puppet	production	+1 -1
wmcs::nfs::primary: remove unused mysql_variances hiera	operations/puppet	production	+0 -4
maintain_dbusers: Remove unused param and adapt to best practices	operations/puppet	production	+32 -37
maintain_dbusers: move out of nfs to services	operations/puppet	production	+27 -27
maintain_dbusers: remove icinga alert, we'll use the default one	operations/puppet	production	+0 -4
replica_cnf: skip toolforge users without a home	operations/puppet	production	+23 -5
maintain-dbusers: add nicer logging with dry run prefix	operations/puppet	production	+7 -3
cr-cloud: permit toolsdb return traffic to cloudcontrols	operations/homer/public	master	+1 -0
maintain-dbusers: skip tool accounts that are not ready	operations/puppet	production	+17 -1
replica_cnf_api: skip tool account that don't have a home	operations/puppet	production	+48 -0
wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH	operations/puppet	production	+3 -3
wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers	operations/puppet	production	+6 -6
toolsdb_replica_cnf: configure if we want to redirect to https	operations/puppet	production	+14 -1
puppet: update firewall rules for cloudcontrol1005	operations/puppet	production	+2 -1
replica_cnf_api_test: check if user with id USER_ID exists	operations/puppet	production	+15 -3
puppet: improvements to replica_cnf_api functional tests	operations/puppet	production	+2 -0
puppet: improvements to replica_cnf_api functional tests	operations/puppet	production	+83 -27
puppet: modify role::wmcs::nfs::primary for replica_cnf api	operations/puppet	production	+1 -0

Status	Subtype	Assigned	Task
Open		None	T272395 Cloud: reduce NAT exceptions from cloud to production
Resolved		Andrew	T291405 [NFS] Reduce or eliminate bare-metal NFS servers
Resolved		Andrew	T301280 Move project-specific NFS mounts onto project-local NFS servers
Resolved		dcaro	T303663 Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM
Resolved		dcaro	T304040 REST api service to manage toolforge replica.my.cnf
Open		None	T214541 python3-ldap3 mixed versions and future traps
Resolved		taavi	T329377 [bug] Server does not start
Resolved		• Marostegui	T330697 labsdbaccounts database grant for cloudcontrol1005
Resolved		None	T330916 Ensure that labstore1004.eqiad.wmnet accept http requests from cloudcontrol1005.wikimedia.org
Resolved		• Marostegui	T331014 Add GRANT access to cloudcontrol1005 for labsadmin to wikireplicas
Resolved		rook	T331056 PAWS cluster for nfs cutover
Resolved	BUG REPORT	dcaro	T332762 New tool not allowed to connect to toolsdb
Invalid	BUG REPORT	dcaro	T332789 [maintain-dbusers] when filtering by a tool we use inconsistent filters
Resolved	BUG REPORT	dcaro	T332798 [maintain-dbusers] When creating accounts, the script bails out processing other accounts if one of them fails in an unexpected way
Open		None	T332955 [maintain-dbusers] Generate prometheus metrics
Resolved		dcaro	T332954 [maintain-dbusers] allow filtering by account type when running maintain

Raymond_Ndibe updated the task description. (Show Details)Feb 24 2023, 10:46 PM

Thanks @Andrew for helping rebuild the paws-nfs1 host!

Change 892446 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] puppet: update firewall rules for cloudcontrol1005

https://gerrit.wikimedia.org/r/892446

dcaro updated the task description. (Show Details)Feb 28 2023, 9:07 AM

dcaro added a subtask: T330697: labsdbaccounts database grant for cloudcontrol1005.Feb 28 2023, 9:10 AM

Change 892446 merged by David Caro:

[operations/puppet@production] puppet: update firewall rules for cloudcontrol1005

https://gerrit.wikimedia.org/r/892446

• Marostegui closed subtask T330697: labsdbaccounts database grant for cloudcontrol1005 as Resolved.Feb 28 2023, 9:21 AM

I know this isn't realistic for the grants, but for firewall rules and puppetization I suggest that this be set up on all cloudcontrol nodes, with an active/passive setup. That will make future hardware refreshes a lot easier.

For the wikireplica grants, this might help: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas#Step_4:_setting_up_GRANTs

dcaro updated the task description. (Show Details)Mar 1 2023, 4:46 PM

Raymond_Ndibe mentioned this in T330916: Ensure that labstore1004.eqiad.wmnet accept http requests from cloudcontrol1005.wikimedia.org.Mar 1 2023, 5:14 PM

Raymond_Ndibe added a subtask: T330916: Ensure that labstore1004.eqiad.wmnet accept http requests from cloudcontrol1005.wikimedia.org.

dcaro updated the task description. (Show Details)Mar 2 2023, 1:55 PM

• Marostegui closed subtask T331014: Add GRANT access to cloudcontrol1005 for labsadmin to wikireplicas as Resolved.Mar 2 2023, 2:15 PM

dcaro updated the task description. (Show Details)Mar 2 2023, 2:32 PM

Change 893760 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolsdb_replica_cnf: configure if we want to redirect to https

https://gerrit.wikimedia.org/r/893760

Change 893760 merged by David Caro:

[operations/puppet@production] toolsdb_replica_cnf: configure if we want to redirect to https

https://gerrit.wikimedia.org/r/893760

Raymond_Ndibe updated the task description. (Show Details)Mar 2 2023, 5:44 PM

dcaro updated the task description. (Show Details)Mar 2 2023, 6:13 PM

dcaro updated the task description. (Show Details)

dcaro added subscribers: Vivian, komla.

dcaro updated the task description. (Show Details)Mar 2 2023, 6:15 PM

vivian-rook opened https://github.com/toolforge/paws/pull/264

dcaro updated the task description. (Show Details)Mar 2 2023, 6:25 PM

dcaro updated the task description. (Show Details)

rook added a subtask: T331056: PAWS cluster for nfs cutover.Mar 2 2023, 6:51 PM

Change 894225 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers

https://gerrit.wikimedia.org/r/894225

Change 894227 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[operations/puppet@production] wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH

https://gerrit.wikimedia.org/r/894227

Change 894225 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: nfs: primary: introduce missing hiera keys for maintain_dbusers

https://gerrit.wikimedia.org/r/894225

Change 894227 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs:nfs:replica_cnf_api_service: update PAWS_REPLICA_CNF_PATH

https://gerrit.wikimedia.org/r/894227

reporting on the issues I noticed while trying to test the refactored maintain-dbusers.py script on the cloudcontrol1005 server.
There where a number of minor issues around module versions but those has been fixed.
The two most important issues are:

pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '172.16.7.153' (timed out)") (172.16.7.153 is the ip of clouddb1001.clouddb-services.eqiad1.wikimedia.cloud)
usefulness of is_active_nfs function in the script.

for the pymysql.err.OperationalError, I had this discussion with Bryan:

Hello andrewbogott , while testing "maintain-dbusers harvest ......" on cloudcontrol1005,  pymysql throws "timed out" error "can't connect to mysql server on 172.16.7.153" (just found out from Bryan that this ip belongs to cloud clouddb1001.clouddb-services.eqiad1.wikimedia.cloud), and when I try to ping the ip 172.16.7.153 It wasn't successful (which is ok because "not on the same realm" thing). The 
10:09 PM 
interesting thing here is that I can't also ping the ip 172.16.7.153 from the labstore1004 server (sure also "not on the same realm"), even though the labstore1004 server hosts the "working" copy of the old maintain-dbusers.py script. If labstore1004 can't connect to 172.16.7.153, doesn't that mean that something is failing in the "running" copy of the maintain-dbusers.py script?
10:10 PM 
more specifically in the "harvest_replica_accounts" function
10:10 PM 
andrewbogott: 
10:12 PM <bd808> Bryan Davis 
We probably have not tried to harvest accounts since ToolsDB moved inside the Cloud VPS realm.
10:15 PM 
The `harvest_replica_accounts` functionality is basically a fallback to recover information that we would typically expect to be in the credential provisioning database if it is somehow lost or corrupted.

the chat above helped to explain things. But If we are experiencing this issue in the old maintain-dbusers.py (which we are), maybe we should investigate this?

For the is_active_nfs function, here is the conversation I had with Bryan about it:

andrewbogott: around? can you help look at this function "is_active_nfs" https://gerrit.wikimedia.org/r/c/operations/puppet/+/809921/35/modules/profile/files/wmcs/nfs/maintain-dbusers.py#1006 ? do we expect the cloudcontrol1005 server to become the "active nfs server" when all this is deployed? if not then maybe we no longer need this function? 
11:59 PM <bd808> Bryan Davis 
Raymond_Ndibe: no, a cloudcontrol will never be the active NFS server. The point of that check was to have the software actively running on multiple servers and have some way to check to see if a particular copy of it should be active. I would assume that we would want some replacement for running on HA pairs like the cloudcontrol* servers, but it is going to somehow look different. It probably turns into a check against some "active"
11:59 PM 
 flag in hiera data (which is really what that check is in disguise)

Looking at this discussion, we might need a replacement for the is_active_nfs function. one tailored to our new unique setup

Raymond_Ndibe updated the task description. (Show Details)Mar 8 2023, 1:27 PM

So far I've only tested the dry-run option. @Andrew and @dcaro it will be nice if we can test the --only-users option together. This is because we need someone who has better knowledge of how the old maintain-dbusers script works, to be certain that the new changes meet requirement

Change 895818 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] replica_cnf_api: skip tool account that don't have a home

https://gerrit.wikimedia.org/r/895818

Change 895756 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: add nicer logging with dry run prefix

https://gerrit.wikimedia.org/r/895756

Change 895814 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: fix systemd service description

https://gerrit.wikimedia.org/r/895814

Change 895838 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: skip tool accounts that are not ready

https://gerrit.wikimedia.org/r/895838

Change 895818 merged by David Caro:

[operations/puppet@production] replica_cnf_api: skip tool account that don't have a home

https://gerrit.wikimedia.org/r/895818

the chat above helped to explain things. But If we are experiencing this issue in the old maintain-dbusers.py (which we are), maybe we should investigate this?

Yep, it actually connects from labstore1004, but only mysql:

root@labstore1004:~# mysql -h 172.16.7.153 -u labsdbadmin -p
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 8905170
Server version: 10.1.44-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:labsdbadmin@172.16.7.153 [(none)]>

So that should be fixed yes, looking

I have also found https://gerrit.wikimedia.org/r/c/operations/puppet/+/895838, not a major issue either (it will crash the run, but the next run it would pass to the next missing user).

Looking at this discussion, we might need a replacement for the is_active_nfs function. one tailored to our new unique setup

Agree, I think it does not make sense anymore as it is yes. We can instead enable the service only on the primary host (probably a new hiera entry like primary_maintaindb_host: cloudcontrol1005), so when we reimage/remove 1005 we will just change puppet at the same time.
It can be a parameter to the maintain_dbusers.pp profile.

About the connectivity issues:

root@labstore1004:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  ae3-1119.cr2-eqiad.wikimedia.org (10.64.37.3)  0.468 ms  0.437 ms  0.427 ms
 2  * * *
 3  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.256 ms  0.266 ms  0.257 ms
 4  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.288 ms  0.295 ms  0.264 ms
 5  172.16.7.153 (172.16.7.153)  0.355 ms  0.358 ms  0.357 ms

--- from cloudcontrol1005
root@cloudcontrol1005:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  ae3-1003.cr2-eqiad.wikimedia.org (208.80.154.67)  0.349 ms  0.352 ms  0.329 ms
 2  * * *
 3  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.173 ms  0.209 ms  0.195 ms
 4  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.241 ms  0.268 ms  0.253 ms
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

looking

I think it might be more complex than just firewall, it seems that packets from cloudcontrol1005 are not reaching, packets from labstore1004 reach and go back, and packets from cloudcephmon1001 only arrive, but the replies are lost (that last one was for testing prod net without wikimedia.org domain).

Tested doing a tcpdump of the source ip on clouddb1001:

root@clouddb1001:~# tcpdump -nvvi any host 10.64.20.67
root@clouddb1001:~# tcpdump -nvvi any host 208.80.154.85
root@clouddb1001:~# tcpdump -nvvi any host 10.64.37.19

And the above traceroute on tcp port 3306, and seeing traffic on tcpdump, and the tracerout output:

root@cloudcephmon1001:~# traceroute --tcp -p 3306 172.16.7.153
traceroute to 172.16.7.153 (172.16.7.153), 30 hops max, 60 byte packets
 1  * * *
 2  xe-3-0-4-1100.cr2-eqiad.eqiad.wmnet (10.64.147.14)  0.310 ms  0.313 ms  0.299 ms
 3  * * *
 4  cloudgw1001.eqiad1.wikimediacloud.org (185.15.56.245)  0.171 ms  0.170 ms  0.148 ms
 5  cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org (185.15.56.238)  0.160 ms  0.179 ms  0.164 ms
 6  * * *
...
30  * * *

Maybe @cmooney or @ayounsi can help here? (tl;dr; we want cloudcontrol1005/1006/1007 to have the same access to 172.16.7.153 that labstore1004 has)
Or @aborrero might know if it's on the cloudgw side or similar.

I think it might be vlan/routing related.

labstore1004 is in the vlan cloud-support -> works
cloudcontrol1005 is in the public vlan -> does not work, nothing reaches clouddb1001
cloudcephmon1001 is in the cloud-hosts vlan -> does not work, but packets reach clouddb1001, just don't come back

Change 896051 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: permit toolsdb return traffic to cloudcontrols

https://gerrit.wikimedia.org/r/896051

For the connectivity, a flotaing IP has been created for clouddb1001 (185.15.56.15), we will use that from now on, that also makes us respect the current network standards

Change 895838 abandoned by David Caro:

[operations/puppet@production] maintain-dbusers: skip tool accounts that are not ready

Reason:

Merged into I1c586de78c2e81fa09843f963dbf9b591c7b0628

https://gerrit.wikimedia.org/r/895838

Change 896051 abandoned by Majavah:

[operations/homer/public@master] cr-cloud: permit toolsdb return traffic to cloudcontrols

Reason:

https://gerrit.wikimedia.org/r/896051

dcaro updated the task description. (Show Details)Mar 10 2023, 1:39 PM

Change 895756 merged by David Caro:

[operations/puppet@production] maintain-dbusers: add nicer logging with dry run prefix

https://gerrit.wikimedia.org/r/895756

rook closed subtask T331056: PAWS cluster for nfs cutover as Resolved.Mar 14 2023, 2:21 PM

dcaro updated the task description. (Show Details)Mar 14 2023, 9:14 PM

dcaro updated the task description. (Show Details)

Change 898852 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] replica_cnf: skip toolforge users without a home

https://gerrit.wikimedia.org/r/898852

dcaro updated the task description. (Show Details)Mar 14 2023, 9:21 PM

Change 898852 merged by David Caro:

[operations/puppet@production] replica_cnf: skip toolforge users without a home

https://gerrit.wikimedia.org/r/898852

ayounsi unsubscribed.Mar 15 2023, 7:15 AM

dcaro updated the task description. (Show Details)Mar 15 2023, 9:04 AM

dcaro updated the task description. (Show Details)

dcaro updated the task description. (Show Details)Mar 15 2023, 9:18 AM

Change 899532 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: replace icinga check with prometheus one

https://gerrit.wikimedia.org/r/899532

dcaro updated the task description. (Show Details)Mar 15 2023, 1:44 PM

Change 899532 merged by David Caro:

[operations/puppet@production] maintain_dbusers: remove icinga alert, we'll use the default one

https://gerrit.wikimedia.org/r/899532

dcaro updated the task description. (Show Details)Mar 15 2023, 3:25 PM

Change 899662 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: move out of nfs to services

https://gerrit.wikimedia.org/r/899662

Change 899663 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain_dbusers: Remove unused param and adapt to best practices

https://gerrit.wikimedia.org/r/899663

dcaro updated the task description. (Show Details)Mar 15 2023, 3:45 PM

dcaro updated the task description. (Show Details)

dcaro updated the task description. (Show Details)Mar 15 2023, 3:48 PM

aborrero added a subtask: T332762: New tool not allowed to connect to toolsdb.Mar 22 2023, 9:26 AM

taavi mentioned this in T332762: New tool not allowed to connect to toolsdb.Mar 22 2023, 9:30 AM

Change 902815 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintain-dbusers: run isort and black and use pep563 types

https://gerrit.wikimedia.org/r/902815

Change 902816 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] maintaint-dbusers: refactor

https://gerrit.wikimedia.org/r/902816

Change 904161 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs::nfs::primary: remove unused mysql_variances hiera

https://gerrit.wikimedia.org/r/904161

Change 899662 merged by David Caro:

[operations/puppet@production] maintain_dbusers: move out of nfs to services

https://gerrit.wikimedia.org/r/899662

Change 899663 merged by David Caro:

[operations/puppet@production] maintain_dbusers: Remove unused param and adapt to best practices

https://gerrit.wikimedia.org/r/899663

Change 904161 merged by David Caro:

[operations/puppet@production] wmcs::nfs::primary: remove unused mysql_variances hiera

https://gerrit.wikimedia.org/r/904161

dcaro updated the task description. (Show Details)Mar 29 2023, 5:06 PM

dcaro updated the task description. (Show Details)

Change 895814 merged by David Caro:

[operations/puppet@production] maintain-dbusers: fix systemd service description

https://gerrit.wikimedia.org/r/895814

Change 902815 merged by David Caro:

[operations/puppet@production] maintain-dbusers: run isort and black and use pep563 types

https://gerrit.wikimedia.org/r/902815

Change 902816 merged by David Caro:

[operations/puppet@production] maintain-dbusers: refactor

https://gerrit.wikimedia.org/r/902816

dcaro closed subtask T332762: New tool not allowed to connect to toolsdb as Resolved.Apr 5 2023, 8:16 AM

taavi closed subtask T330916: Ensure that labstore1004.eqiad.wmnet accept http requests from cloudcontrol1005.wikimedia.org as Resolved.Apr 6 2023, 8:10 PM

Change 907132 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] cr-cloud: remove clouddb_return term

https://gerrit.wikimedia.org/r/907132

Change 907132 merged by jenkins-bot:

[operations/homer/public@master] cr-cloud: remove clouddb_return term

https://gerrit.wikimedia.org/r/907132

dcaro closed subtask T332954: [maintain-dbusers] allow filtering by account type when running maintain as Resolved.Apr 11 2023, 9:44 AM

dcaro changed the status of subtask T332955: [maintain-dbusers] Generate prometheus metrics from Open to In Progress.Apr 11 2023, 9:47 AM

Andrew reassigned this task from Andrew to dcaro.Apr 25 2023, 1:41 AM

dcaro closed subtask T304040: REST api service to manage toolforge replica.my.cnf as Resolved.Apr 26 2023, 7:57 AM

dcaro updated the task description. (Show Details)

dcaro edited projects, added cloud-services-team (FY2022/2023-Q4); removed cloud-services-team (FY2022/2023-Q3).

dcaro changed the task status from Open to In Progress.Apr 26 2023, 8:00 AM

dcaro moved this task from Backlog to In progress on the cloud-services-team (FY2022/2023-Q4) board.

fnegri moved this task from FY2022/2023-Q4 to FY2023/2024-Q1-Q2 on the cloud-services-team board.Jul 26 2023, 5:46 PM

fnegri edited projects, added cloud-services-team (FY2023/2024-Q1-Q2); removed cloud-services-team (FY2022/2023-Q4).

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

dcaro closed this task as Resolved.Aug 30 2023, 4:14 PM

dcaro moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.

dcaro changed the status of subtask T332955: [maintain-dbusers] Generate prometheus metrics from In Progress to Open.Feb 22 2024, 3:00 PM

Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM
Closed, ResolvedPublic
Actions

Description

Rollout plan (Copied from comment T303663#8500586, with extra T303663#8621555 added)

Details

Related Objects
Search...

Event Timeline

	Andrew
	Mar 12 2022, 4:08 AM

Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VMClosed, ResolvedPublicActions

Description

Rollout plan (Copied from comment T303663#8500586, with extra T303663#8621555 added)

Details

Related ObjectsSearch...

Event Timeline

Split maintain-dbusers.py into two parts, one to run on cloudcontrol nodes and one to run on an NFS server VM
Closed, ResolvedPublic
Actions

Related Objects
Search...