Page MenuHomePhabricator

toolsdb: firewalling changes for new setup (temporal mysql replication)
Closed, ResolvedPublic

Description

We are setting up a new database in CloudVPS to replace labsdb1004/labsdb1005. In order to accomplish the replacement, we need to set up temporal MySQL replication between the old and new servers. In this case temporal means until we drop/decom the hardware.

Involved addresses are:

  • labsdb1004.eqiad.wnet IP addr: 10.64.37.8
  • labsdb1005.eqiad.wmnet IP addr: 10.64.37.9
  • clouddb1001.eqiad.wmflabs IP addr: 172.16.7.153 (and also 185.15.56.54, but we would like to avoid using this addr, was only set up for testing stuff)
  • clouddb1004.eqiad.wmflabs IP addr: 172.16.7.154

We need to establish the following connections, from labsdb to the VMs using MYSQL TCP port:

VMs --> hardware (3306/tcp)
hardware --> VMs (3306/tcp)

According to my tests, the first case VM --> hadrware (3306/tcp) is already working:

aborrero@clouddb1001:~$ telnet labsdb1005.eqiad.wmnet 3306
Trying 10.64.37.9...
Connected to labsdb1005.eqiad.wmnet.
Escape character is '^]'.
Y
5.5.5-10.0.34-MariaDB�5An'v=m/A��?�98Oc$'9'UZAwmysql_native_password^[[A^CConnection closed by foreign host.

However, the case hardware --> VMs (3306/tcp) is not working:

aborrero@labsdb1005:~$ telnet 172.16.7.153 3306
Trying 172.16.7.153...
telnet: Unable to connect to remote host: Connection timed out

I could further confirm this by running tcpdump in our neutron virtual router, and see packets flowing but being dropped elsewhere:

aborrero@cloudnet1004:~$ sudo tcpdump -i eno50 host 172.16.7.153 and tcp port 3306
12:06:53.595239 IP labsdb1005.eqiad.wmnet.43366 > 172.16.7.153.mysql: Flags [S], seq 3383912041, win 29200, options [mss 1460,sackOK,TS val 88585927 ecr 0,nop,wscale 9], length 0
12:06:53.600242 IP 172.16.7.153.mysql > labsdb1005.eqiad.wmnet.43366: Flags [S.], seq 1656666748, ack 3383912042, win 28960, options [mss 1460,sackOK,TS val 17160861 ecr 88585927,nop,wscale 9], length 0

I've read the config in cr2-eqiad (https://librenms.wikimedia.org/device/device=2/tab=showconfig/) and I think adding this to filter cloud-in4 would solve our problems in the hardware --> VMs (3306/tcp) case:

term labsdb_inverse {
        from {
                destination-address {
                        /* labsdb1004, labsdb1005 */
                        10.64.37.8/31;
                        /* labsdb1006 */
                        10.64.37.11/32;
                        /* labsdb1007 */
                        10.64.37.12/32;
                        /* dbproxy1010, dbproxy1011 */
                        10.64.37.14/31;
                }
                protocol tcp;
                source-port [ 3306 ];
        }
        then accept;
}

since we already have this for the other case VMs --> hardware (3306/tcp):

/*
** labsdb100[4567] are being decommed - T193264
*/
term labsdb {
        from {
                destination-address {
                        /* labsdb1004, labsdb1005 */
                        10.64.37.8/31;
                        /* labsdb1006 */
                        10.64.37.11/32;
                        /* labsdb1007 */
                        10.64.37.12/32;
                        /* dbproxy1010, dbproxy1011 */
                        10.64.37.14/31;
                }
                protocol tcp;
                destination-port [ 873 3306 5432 22 ];
        }
        then accept;
}

Not sure if the two can be merged into one single term, I'm not familiar with the firewall expression logic.

We may need to create additional VMs, but since their addr is not in the filters, we should be fine.

NOTE: 3 additional IPs were missed in this 10.64.37.18-20. We may not be out of the water with this problem until those are given the same general treatment.

Event Timeline

aborrero triaged this task as Unbreak Now! priority.Feb 17 2019, 12:28 PM
aborrero created this task.
aborrero added projects: SRE, netops.
aborrero added a project: DBA.
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Thanks!

[edit firewall family inet filter cloud-in4]
       term labsdb { ... }
+      /* T216353 */
+      term labsdb_return {
+          from {
+              destination-address {
+                  /* labsdb1004, labsdb1005 */
+                  10.64.37.8/31;
+                  /* labsdb1006 */
+                  10.64.37.11/32;
+                  /* labsdb1007 */
+                  10.64.37.12/32;
+                  /* dbproxy1010, dbproxy1011 */
+                  10.64.37.14/31;
+              }
+              protocol tcp;
+              source-port 3306;
+          }
+          then accept;
+      }
       term labmon--http { ... }

Mentioned in SAL (#wikimedia-operations) [2019-02-17T13:14:08Z] <XioNoX> add term labsdb_return to cloud-in4 - T216353

aborrero claimed this task.

It works!

aborrero@labsdb1005:~$ telnet 172.16.7.153 3306
Trying 172.16.7.153...
Connected to 172.16.7.153.
Escape character is '^]'.
Y
5.5.5-10.1.38-MariaDB ?&=D|c\#��?�7-y>eeK'!n)!mysql_native_password^CConnection closed by foreign host.

Thanks you @ayounsi !

We need labstore1004 and 5 more than these. 10.64.37.18, 10.64.37.19 and 10.64.37.20

maintain-dbusers runs on the lasbstore, not on the labsdb servers. This will be good to have, however, because we might need to do replication if a lot of changes have come through.

Without labstores1004 and 1005 being able to hit the database, new user accounts on toolsdb will not be created with the current setup.

We've found a way around the criticality of this. We'll still need those components, and we should update the description, but! It can be low priority :) @aborrero found a good way around it.

Bstorm lowered the priority of this task from Unbreak Now! to Low.Feb 17 2019, 7:18 PM
Bstorm updated the task description. (Show Details)
Bstorm raised the priority of this task from Low to High.Feb 18 2019, 6:17 AM

Drat! The priority on the last fixup of adding those three IP addresses is now high again. We found functionality that we really cannot easily replicate with the alternative plan.

We need to be able to have labstore100[45] and labsdb1004 talk to port 3306 on clouddb1001.clouddb-services.eqiad.wmflabs. This will allow us to leave the maintain-dbusers script on the labstores (T216373#4960664) and also to use labsdb1004 as temporary replica of clouddb1001 until we get the second cloudvirt and its giant clouddbXXXX instances online.

Verifying that CR is indeed blocking the connections:

aborrero@labstore1004:~$ telnet 172.16.7.153 3306
Trying 172.16.7.153...
telnet: Unable to connect to remote host: Connection timed out

aborrero@cloudnet1004:~$ sudo tcpdump -i eno50 host labstore1004.eqiad.wmnet and tcp port 3306
11:37:47.358071 IP labstore1004.eqiad.wmnet.59510 > 172.16.7.153.mysql: Flags [S], seq 2815090186, win 42340, options [mss 1460,sackOK,TS val 3732952386 ecr 0,nop,wscale 11], length 0
11:37:47.358704 IP 172.16.7.153.mysql > labstore1004.eqiad.wmnet.59510: Flags [S.], seq 329403381, ack 2815090187, win 28960, options [mss 1460,sackOK,TS val 38324302 ecr 3732952386,nop,wscale 9], length 0
aborrero@labstore1005:~$ telnet 172.16.7.153 3306
Trying 172.16.7.153...
telnet: Unable to connect to remote host: Connection timed out

aborrero@cloudnet1004:~$ sudo tcpdump -i eno50 host labstore1005.eqiad.wmnet and tcp port 3306
11:41:46.580806 IP labstore1005.eqiad.wmnet.48186 > 172.16.7.153.mysql: Flags [S], seq 735851217, win 42340, options [mss 1460,sackOK,TS val 3728352303 ecr 0,nop,wscale 11], length 0
11:41:46.581549 IP 172.16.7.153.mysql > labstore1005.eqiad.wmnet.48186: Flags [S.], seq 2973830138, ack 735851218, win 28960, options [mss 1460,sackOK,TS val 38384108 ecr 3728352303,nop,wscale 9], length 0

I would say we need something like this diff in the cloud-in4 filter:

@@ -1,5 +1,5 @@
 /* T216353 */
-term labsdb_return {
+term clouddb_return {
 	from {
 		destination-address {
 			/* labsdb1004, labsdb1005 */
@@ -10,6 +10,12 @@
 			10.64.37.12/32;
 			/* dbproxy1010, dbproxy1011 */
 			10.64.37.14/31;
+			/* labstore1004 */
+			10.64.37.19/32;
+			/* labstore1005 */
+			10.64.37.20/32;
+			/* nfs-tools.project.svc.eqiad.wmnet */
+			10.64.37.18/32
 		}
 		protocol tcp;
 		source-port 3306;

i.e, this final configuration:

/* T216353 */
term clouddb_return {
	from {
		destination-address {
			/* labsdb1004, labsdb1005 */
			10.64.37.8/31;
			/* labsdb1006 */
			10.64.37.11/32;
			/* labsdb1007 */
			10.64.37.12/32;
			/* dbproxy1010, dbproxy1011 */
			10.64.37.14/31;
			/* labstore1004 */
			10.64.37.19/32;
			/* labstore1005 */
			10.64.37.20/32;
			/* nfs-tools.project.svc.eqiad.wmnet */
			10.64.37.18/32
		}
		protocol tcp;
		source-port 3306;
	}
	then accept;
}
[edit firewall family inet filter cloud-in4]
       term labsdb { ... }
+      /* T216353 */
+      term clouddb_return {
+          from {
+              destination-address {
+                  /* labsdb1004, labsdb1005 */
+                  10.64.37.8/31;
+                  /* labsdb1006 */
+                  10.64.37.11/32;
+                  /* labsdb1007 */
+                  10.64.37.12/32;
+                  /* dbproxy1010, dbproxy1011 */
+                  10.64.37.14/31;
+                  /* labstore1005 */
+                  10.64.37.20/32;
+                  /* nfs-tools, labstore1004 */
+                  10.64.37.18/31;
+              }
+              protocol tcp;
+              source-port 3306;
+          }
+          then accept;
+      }                                
       term labmon--http { ... }
[edit firewall family inet filter cloud-in4]
-      /* T216353 */
-      term labsdb_return {
-          from {
-              destination-address {
-                  /* labsdb1004, labsdb1005 */
-                  10.64.37.8/31;
-                  /* labsdb1006 */
-                  10.64.37.11/32;
-                  /* labsdb1007 */
-                  10.64.37.12/32;
-                  /* dbproxy1010, dbproxy1011 */
-                  10.64.37.14/31;
-              }
-              protocol tcp;
-              source-port 3306;
-          }
-          then accept;
-      }

Mentioned in SAL (#wikimedia-operations) [2019-02-18T12:28:31Z] <XioNoX> update clouddb_return term from cloud-in4 on cr1/2-eqiad - T216353

It works!

aborrero@labstore1004:~ $ telnet 172.16.7.153 3306
Trying 172.16.7.153...
Connected to 172.16.7.153.
Escape character is '^]'.
Y
5.5.5-10.1.38-MariaDB+$[{}"+R>��?�}}|tQx)_(D_4mysql_native_password^CConnection closed by foreign host.
aborrero@labstore1005:~ $ telnet 172.16.7.153 3306
Trying 172.16.7.153...
Connected to 172.16.7.153.
Escape character is '^]'.
Y
5.5.5-10.1.38-MariaDB,=Re$"zfn��?�v(_+R=ym5af\mysql_native_password^CConnection closed by foreign host.