Page MenuHomePhabricator

investigate HHVM mysqlExtension::ConnectTimeout
Closed, ResolvedPublicPRODUCTION ERROR

Description

Over the past few months we've seen an increases in aborted connections to databases:

Error connecting to 10.64.48.15: Can't connect to MySQL server on '10.64.48.15' (4)

They appear in surges from multiple app servers to mulitple db slaves with no apparent direct link to shard or wiki or mariadb version. Note that this is aborts occurring during initial TCP connection phase, not disconnects during query execution.

Mediawiki core databases classes try to set DB connection timeout to 3 seconds, DatabaseMysqli with MYSQLI_OPT_CONNECT_TIMEOUT and DatabaseMysql with php's ini mysql.connect_timeout. It isn't clear whether HHVM respects either of these.

HHVM php_mysql_do_connect_on_link() defaults to mysqlExtension::ConnectTimeout which is 1 second. That is just a little fragile during traffic spikes or near-outage conditions.

Update: The underlying issue seems to have been resolved by augmenting the hhvm.mysql.connect_timout parameter, but we may want to compile it in as default.

Event Timeline

Springle raised the priority of this task from to Medium.
Springle updated the task description. (Show Details)
Springle added projects: acl*sre-team, HHVM.
Springle added subscribers: Springle, Joe.

This happened to me at least twice today when saving an edit, on multiple wikis (e.g. ruwiki, dewiki). In one case the edit got saved despite the error, in the other not.

Change 211155 had a related patch set uploaded (by BryanDavis):
Set HHVM mysql connection timeout to 3s

https://gerrit.wikimedia.org/r/211155

Springle raised the priority of this task from Medium to High.May 18 2015, 4:25 AM
Springle set Security to None.

Change 211155 merged by Ori.livneh:
Set HHVM mysql connection timeout to 3s on canary servers

https://gerrit.wikimedia.org/r/211155

Change 214295 had a related patch set uploaded (by BryanDavis):
Set HHVM mysql connection timeout to 3s on app and api servers

https://gerrit.wikimedia.org/r/214295

Change 214392 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: raise the mysql timeout to 3 seconds

https://gerrit.wikimedia.org/r/214392

Change 214392 merged by Giuseppe Lavagetto:
mediawiki: raise the mysql timeout to 3 seconds

https://gerrit.wikimedia.org/r/214392

Change 214295 merged by Ori.livneh:
Set HHVM mysql connection timeout to 3s on app and api servers

https://gerrit.wikimedia.org/r/214295

Maybe we have some cross vlan communication issues:

1$ ggml -n 1000000 -o "{ip} {db_server}" -d 1h -m type:mediawiki -m channel:wfLogDBError 'message:"Error connecting to"' | awk '{gsub(/\.[0-9]+$/, "", $1); gsub(/\.[0-9]+$/, "", $2); print $1, $2}'|sort|uniq -c|sort -n
2
3 1 10.192.1 10.64.16
4 1 10.192.33 10.64.16
5 1 10.64.1 10.64.48
6 1 10.64.16 10.64.0
7 1 10.64.17 10.64.32
8 1 10.64.49 10.64.32
9 1 208.80.154 10.64.16
10 4 10.64.17 10.64.48
11 5 10.64.33 10.64.16
12 7 10.64.33 10.64.48
13 14 127.0.0 10.64.32
14 20 10.64.0 10.64.0
15 20 10.64.32 10.64.0
16 61 127.0.0 10.64.48
17 91 10.64.16 10.64.32
18 122 10.64.16 10.64.16
19 151 10.64.16 10.64.48
20 374 10.64.0 10.64.32
21 450 10.64.32 10.64.32
221101 10.64.0 10.64.48
231148 10.64.0 10.64.16
241225 10.64.32 10.64.48
251318 10.64.32 10.64.16

Changing the HHVM default connection timeout doesn't seem to have had a measurable effect of the error rate.

@faidon said on irc that he remembered seeing similar issues in the past that were traced to a row with only 2 uplinks. This leads to some speculation that there may be some cross rack/row combinations that are having bandwidth/latency problems.

@jcrespo did some investigation and found a strong correlation between QPS to the database servers and error rate (higher QPS, more errors; lower QPS, fewer errors). This in and of itself is not very informative however. Having a ratio of activity to errors would be more likely to provide debugging help.

Currently the average is almost 2 per second and it's been fairly stable for a while. This is a lot of errors. Should we report an hhvm bug about not respecting the connection timeout setting?

@mmodell Yes, this is causing real problems: for example, we were unable to detect T101182 for a while because of the noise this was creating (which had a very similar pattern).

But I do not see any configuration change on hhvm on our side for the last patch:

root@mw1026:/etc$ grep -R "mysql.con" *
Binary file alternatives/php matches
apparmor.d/usr.sbin.mysqld:  /etc/mysql/conf.d/ r,
apparmor.d/usr.sbin.mysqld:  /etc/mysql/conf.d/* r,
grep: blkid.tab: No such file or directory
init/mysql.conf:    ERR_LOGGER="logger -p daemon.err -t /etc/init/mysql.conf -i"
mysql/my.cnf:!includedir /etc/mysql/conf.d/
grep: nologin: No such file or directory
php5/apache2/php.ini:mysql.connect_timeout = 1
php5/cli/php.ini:mysql.connect_timeout = 1
jynus@mw1026:~$ cat test.php
<?php 
echo ini_get('hhvm.mysql.slow_query_threshold')."\n";
echo ini_get('hhvm.mysql.connect_timeout');
?>
jynus@mw1026:~$ hhvm test.php
10000
1000

Only non-hiera conf is there.

While a proper config ini fix would be both welcome and necessary... have discussed with @Joe the possibility of just patching hhvm's hardcoded mysqlExtension::ConnectTimeout to 3000ms in our next build.

Entirely his call though, since he's doing all the work.

We have added what seemed to be the proper hiera configuration:

hhvm::extra::fcgi_settings:
  hhvm:
    mysql:
      connect_timeout: 3000
hhvm::extra::cli_settings:
  hhvm:
    mysql:
      connect_timeout: 3000

but I am not seeing that reflected in the config on the HHVM servers:

$ ssh mw1026.eqiad.wmnet
$ grep mysql /etc/hhvm/fcgi.ini
hhvm.mysql.slow_query_threshold = 10000
hhvm.mysql.typed_results = false

Pretty obviously the config patches I made are not correct for our Puppet process.

Change 215931 had a related patch set uploaded (by Giuseppe Lavagetto):
hhvm: actually set the timeout on normal appserver, restore on canaries

https://gerrit.wikimedia.org/r/215931

Change 215931 merged by Giuseppe Lavagetto:
hhvm: actually set the timeout on normal appserver, restore on canaries

https://gerrit.wikimedia.org/r/215931

The setting is now applied everywhere.

According to logstash the error rate has gone from ~90/minute before the patch to ~2/minute after.

So I assumed the original change didn't work because of something not working in that ini setting. In fact it turned out that the patch for the canaries (which I babysat) had the correct entries, but was later removed, and that the change for normal appservers had just been wrong all along.

I don't think we need to hardcode our preferred value there anymore.

I still need to do a patch so that this value is the default for us from now on.

jcrespo lowered the priority of this task from High to Medium.
jcrespo updated the task description. (Show Details)
jcrespo removed a project: Patch-For-Review.

Thank you everyone, that largely removed the spam we were seeing in logstash! Kudos to everyone involved!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM