Page MenuHomePhabricator

Perform testing for TLS effect on connection rate
Closed, ResolvedPublic

Description

To figure out what range of MediaWiki / mariadb traffic will use TLS and whether proxying is a must-have, we need to test the connection rate with a test mariadb host. It could be a new host or possibly an existing codfw one getting no traffic (aside from replication).

The client host could probably just be terbium/wasat. Tests cases are:
a) Within eqiad
b) eqiad <=> codfw

Event Timeline

Hey,

I would be nice to do a test with MariaDB 10.0 and 10.1 if possible, to see if there are any regressions.
For that matters, on codfw you can pretty much use any slave for 10.0 (they all have pretty much the same HW), so for the sake of picking one from s1:
db2048

If you want to do the same test with MariaDB 10.1, you could use db2062. db2062 is being used lately to reclone some hosts, so you might want to give us a heads up before using it, to make sure it doesn't have mysql down for one of those maintenances.

Hey,

I would be nice to do a test with MariaDB 10.0 and 10.1 if possible, to see if there are any regressions.
For that matters, on codfw you can pretty much use any slave for 10.0 (they all have pretty much the same HW), so for the sake of picking one from s1:
db2048

If you want to do the same test with MariaDB 10.1, you could use db2062. db2062 is being used lately to reclone some hosts, so you might want to give us a heads up before using it, to make sure it doesn't have mysql down for one of those maintenances.

I noticed wasat does not have /etc/mysql/ssl though those are among the default client TLS parameters. I guess it has $ssl off when used in mariadb-config.

I keep coming with times like:

Same-DC (db2070.codfw.wmnet):
string(56) "0.10926739454269 sec/conn (non-SSL) [db2070.codfw.wmnet]"
string(58) "0.036373572349548 sec/query (non-SSL) [db2070.codfw.wmnet]"
string(51) "0.2467001414299 sec/conn (SSL) [db2070.codfw.wmnet]"
string(54) "0.036427383422852 sec/query (SSL) [db2070.codfw.wmnet]"
Cross-DC (db1055.eqiad.wmnet):
string(56) "0.10925793647766 sec/conn (non-SSL) [db1055.eqiad.wmnet]"
string(58) "0.036399509906769 sec/query (non-SSL) [db1055.eqiad.wmnet]"
string(52) "0.24790946722031 sec/conn (SSL) [db1055.eqiad.wmnet]"
string(54) "0.036438231468201 sec/query (SSL) [db1055.eqiad.wmnet]"

The same-DC times seem oddly the same as cross-DC, which doesn't seem right.

Do you have a set of instructions you run, so I can reproduce and I can check TLS is effectively enabled with tcpdump, plus see the state of connections, etc.? Where you run it (host, path) and which code you try to execute?

Hi @aaron, I would like still to reproduce your results. Meanwhile, I thought a reason why that can be- with semisync, we on purpose slowdown writes so that we wait an extra roundtrip to one of the replicas to make sure data is in at least 2 servers before commiting. This could contribute to a slowdown to start with- not because it is as fast, but because it is already slowed for reliability reasons.

I would need your test to confirm such a suspicion- it still would be smaller than cross-dc.

I fixed a stupid hostname var bug. Now I get numbers that make sense:

Same-DC (db2070.codfw.wmnet):
string(57) "0.001196186542511 sec/conn (non-SSL) [db2070.codfw.wmnet]"
string(60) "0.00027136325836182 sec/query (non-SSL) [db2070.codfw.wmnet]"
string(53) "0.059528641700745 sec/conn (SSL) [db2070.codfw.wmnet]"
string(56) "0.00028834581375122 sec/query (SSL) [db2070.codfw.wmnet]"
Cross-DC (db1055.eqiad.wmnet):
string(56) "0.10918385744095 sec/conn (non-SSL) [db1055.eqiad.wmnet]"
string(57) "0.03636349439621 sec/query (non-SSL) [db1055.eqiad.wmnet]"
string(52) "0.25189030647278 sec/conn (SSL) [db1055.eqiad.wmnet]"
string(54) "0.036419949531555 sec/query (SSL) [db1055.eqiad.wmnet]"

The code is run from wasat and in my home dir (simpleConnTest.php):

<?php

function getConn( $host, $pass, $ssl = false ) {
        $mysqli = new mysqli();
        $mysqli->init();
        if ( $ssl ) {
                $mysqli->ssl_set('/etc/mysql/ssl/server.key', '/etc/mysql/ssl/cert.pem', '/etc/ssl/certs/Puppet_Internal_CA.pem', NULL, 'TLSv1.2');
        }

        $mysqli->real_connect($host, 'wikiadmin', $pass, 'information_schema', 3306) or die ('Could not connect');

        return $mysqli;
}

$pass = readline( "MYSQL password:\n" );

function run( $host, $pass ) {
        $etConnSSL = $etConn = 0;
        $etQuerySSL = $etQuery = 0;
        for ( $i=0; $i<100; ++$i ) {
                $t1 = microtime( true );
                $mysqli = getConn($host, $pass, true);
                $etConnSSL += microtime( true ) - $t1;

                $t1 = microtime( true );
                $res = $mysqli->query('SELECT @@ssl_cipher');
                $etQuerySSL += microtime( true ) - $t1;

                $mysqli->close();

                $t1 = microtime( true );
                $mysqli = getConn($host, $pass, false);
                $etConn += microtime( true ) - $t1;

                $t1 = microtime( true );
                $res = $mysqli->query('SELECT @@ssl_cipher');
                $etQuery += microtime( true ) - $t1;

                $mysqli->close();
        }

        var_dump( $etConn/100 . " sec/conn (non-SSL) [$host]" );
        var_dump( $etQuery/100 . " sec/query (non-SSL) [$host]" );

        var_dump( $etConnSSL/100 . " sec/conn (SSL) [$host]" );
        var_dump( $etQuerySSL/100 . " sec/query (SSL) [$host]" );
}

$host = 'db2070.codfw.wmnet';
echo "Same-DC ($host):\n";
run( $host, $pass );
$host = 'db1055.eqiad.wmnet';
echo "Cross-DC ($host):\n";
run( $host, $pass );

So actually, that is not really that bad- query times are similar (only some small overhead), connection times double or more for the cross-dc option. So the only thing to resolve is trying to have persistent connections ready in advance?- which was what I tried and failed to do with ProxySQL due to its limitations. Maybe there is a better connection pooling solution?

One thing I just realized is that there could be some connection overhead on db1055- I will (or you can) try on db1111, which should be idle for better comparison to codfw idle connection hosts. Or we can try on busy servers on both cases. For selects, we should also read from tables- e.g. heartbeat. SELECT a variable will have the least query overhead (results from memory), making the results unrealistic for the typical query (other than the useless sets we run).

Is the goal here just to quantify the impact? Or is there a target connect time/query time that we're trying to achieve?

These are my results with your script, just changing the query to run on a real table (heartbeat) and with more similar-hardware servers:

Same-DC (db2033.codfw.wmnet):
string(57) "0.001132071018219 sec/conn (non-SSL) [db2033.codfw.wmnet]"
string(60) "0.00024072647094727 sec/query (non-SSL) [db2033.codfw.wmnet]"
string(53) "0.057012629508972 sec/conn (SSL) [db2033.codfw.wmnet]"
string(56) "0.00025907039642334 sec/query (SSL) [db2033.codfw.wmnet]"
Cross-DC (db1031.eqiad.wmnet):
string(55) "0.1113884806633 sec/conn (non-SSL) [db1031.eqiad.wmnet]"
string(58) "0.036313643455505 sec/query (non-SSL) [db1031.eqiad.wmnet]"
string(52) "0.22943157196045 sec/conn (SSL) [db1031.eqiad.wmnet]"
string(54) "0.036422135829926 sec/query (SSL) [db1031.eqiad.wmnet]"

Query time seems to be 2-6% slower, connection time is double to 50x slower on remote - local dc. We need to actually check the throughput impact, with more threads, as I think we are more permissive with write latency. In any case, it seems to me that the problem is not as much the rolling of TLS, but the remote connections being 1000x slower (110ms to 230ms just to connect is a huge penalty). I know that mostly goes away with a proxy, but proxysql does not support TLSv1.2. Should we think about alternative connection pooling solutions?

If I use proxysql, pointing to db1031, we get better results than querying to a host on the same dc, but remote:

Cross-DC ():
string(40) "0.0002328896522522 sec/conn (non-SSL) []"
string(40) "0.036425504684448 sec/query (non-SSL) []"

The host is empty because for some reason, connecting with a socket on HHVM requires a NULL host.

I will give proxySQL a second look to see if enabling TLS1.2 is a viable option. Using it not only would be useful for cross-dc connections, it would also be for regular connections (compare its numbers to Same-DC.

Change 404154 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] proxysql: Changes added (simplifications) to the proxysql class

https://gerrit.wikimedia.org/r/404154

I have uploaded the proxysql 2.0 (still in developement) version to debian stretch, I will upload the one for jessie on Monday and will check if that fixes TLS1.2 support.

Change 431720 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] proxysql: Changes added (simplifications) to the proxysql class

https://gerrit.wikimedia.org/r/431720

Change 404154 abandoned by Jcrespo:
proxysql: Changes added (simplifications) to the proxysql class

Reason:
Obsoleted by https://gerrit.wikimedia.org/r/431720

https://gerrit.wikimedia.org/r/404154

Change 431720 merged by Jcrespo:
[operations/puppet@production] proxysql: Changes added (simplifications) to the proxysql class

https://gerrit.wikimedia.org/r/431720

I am going to consider this resolved- testing was done, it is not enough, but we cannot just keep this open forever. We will move more specific actionables to a separate ticket.