investigate HHVM mysqlExtension::ConnectTimeout
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Assigned To

Authored By

	• Springle
	May 7 2015, 3:14 PM

Description

Over the past few months we've seen an increases in aborted connections to databases:

Error connecting to 10.64.48.15: Can't connect to MySQL server on '10.64.48.15' (4)

They appear in surges from multiple app servers to mulitple db slaves with no apparent direct link to shard or wiki or mariadb version. Note that this is aborts occurring during initial TCP connection phase, not disconnects during query execution.

Mediawiki core databases classes try to set DB connection timeout to 3 seconds, DatabaseMysqli with MYSQLI_OPT_CONNECT_TIMEOUT and DatabaseMysql with php's ini mysql.connect_timeout. It isn't clear whether HHVM respects either of these.

HHVM php_mysql_do_connect_on_link() defaults to mysqlExtension::ConnectTimeout which is 1 second. That is just a little fragile during traffic spikes or near-outage conditions.

Update: The underlying issue seems to have been resolved by augmenting the hhvm.mysql.connect_timout parameter, but we may want to compile it in as default.

Details

Subject	Repo	Branch	Lines +/-
hhvm: actually set the timeout on normal appserver, restore on canaries	operations/puppet	production	+12 -4
Set HHVM mysql connection timeout to 3s on canary servers	operations/puppet	production	+16 -0
Set HHVM mysql connection timeout to 3s on app and api servers	operations/puppet	production	+16 -0
mediawiki: raise the mysql timeout to 3 seconds	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T104573: codfw frontends cannot connect to mysql at db2029
rOPUP958d54353002: hhvm: actually set the timeout on normal appserver, restore on canaries
rOPUPaa1d8ef5f6d7: Set HHVM mysql connection timeout to 3s on app and api servers
rOPUPe463b9853d12: mediawiki: raise the mysql timeout to 3 seconds
rOPUPf46607042de6: Set HHVM mysql connection timeout to 3s on canary servers
T98998: Database connection failure issues on s7 shard
Mentioned Here: T101182: Permission problem on parsercache db servers
P703 mw vlan -> db vlan errors

Event Timeline

• Springle created this task.May 7 2015, 3:14 PM

• Springle raised the priority of this task from to Medium.

• Springle updated the task description. (Show Details)

• Springle added projects: acl*sre-team, HHVM.

• Springle added subscribers: • Springle, Joe.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 7 2015, 3:14 PM

• Springle mentioned this in T98998: Database connection failure issues on s7 shard.May 14 2015, 12:21 AM

bd808 subscribed.May 15 2015, 6:07 PM

This happened to me at least twice today when saving an edit, on multiple wikis (e.g. ruwiki, dewiki). In one case the edit got saved despite the error, in the other not.

jcrespo subscribed.May 15 2015, 6:10 PM

Change 211155 had a related patch set uploaded (by BryanDavis):
Set HHVM mysql connection timeout to 3s

https://gerrit.wikimedia.org/r/211155

gerritbot added a project: Patch-For-Review.May 15 2015, 7:20 PM

• Springle raised the priority of this task from Medium to High.May 18 2015, 4:25 AM

• Springle set Security to None.

Change 211155 merged by Ori.livneh:
Set HHVM mysql connection timeout to 3s on canary servers

https://gerrit.wikimedia.org/r/211155

bd808 mentioned this in rOPUPf46607042de6: Set HHVM mysql connection timeout to 3s on canary servers.May 21 2015, 12:08 AM

Change 214295 had a related patch set uploaded (by BryanDavis):
Set HHVM mysql connection timeout to 3s on app and api servers

https://gerrit.wikimedia.org/r/214295

Change 214392 had a related patch set uploaded (by Giuseppe Lavagetto):
mediawiki: raise the mysql timeout to 3 seconds

https://gerrit.wikimedia.org/r/214392

Change 214392 merged by Giuseppe Lavagetto:
mediawiki: raise the mysql timeout to 3 seconds

https://gerrit.wikimedia.org/r/214392

Joe mentioned this in rOPUPe463b9853d12: mediawiki: raise the mysql timeout to 3 seconds.May 28 2015, 5:42 PM

Change 214295 merged by Ori.livneh:
Set HHVM mysql connection timeout to 3s on app and api servers

https://gerrit.wikimedia.org/r/214295

ori mentioned this in rOPUPaa1d8ef5f6d7: Set HHVM mysql connection timeout to 3s on app and api servers.May 28 2015, 9:43 PM

Maybe we have some cross vlan communication issues:

P703 mw vlan -> db vlan errors

1	$ ggml -n 1000000 -o "{ip} {db_server}" -d 1h -m type:mediawiki -m channel:wfLogDBError 'message:"Error connecting to"' \| awk '{gsub(/\.[0-9]+$/, "", $1); gsub(/\.[0-9]+$/, "", $2); print $1, $2}'\|sort\|uniq -c\|sort -n
2
3	1 10.192.1 10.64.16
4	1 10.192.33 10.64.16
5	1 10.64.1 10.64.48
6	1 10.64.16 10.64.0
7	1 10.64.17 10.64.32
8	1 10.64.49 10.64.32
9	1 208.80.154 10.64.16
10	4 10.64.17 10.64.48
11	5 10.64.33 10.64.16
12	7 10.64.33 10.64.48
13	14 127.0.0 10.64.32
14	20 10.64.0 10.64.0
15	20 10.64.32 10.64.0
16	61 127.0.0 10.64.48
17	91 10.64.16 10.64.32
18	122 10.64.16 10.64.16
19	151 10.64.16 10.64.48
20	374 10.64.0 10.64.32
21	450 10.64.32 10.64.32
22	1101 10.64.0 10.64.48
23	1148 10.64.0 10.64.16
24	1225 10.64.32 10.64.48
25	1318 10.64.32 10.64.16

Changing the HHVM default connection timeout doesn't seem to have had a measurable effect of the error rate.

@faidon said on irc that he remembered seeing similar issues in the past that were traced to a row with only 2 uplinks. This leads to some speculation that there may be some cross rack/row combinations that are having bandwidth/latency problems.

@jcrespo did some investigation and found a strong correlation between QPS to the database servers and error rate (higher QPS, more errors; lower QPS, fewer errors). This in and of itself is not very informative however. Having a ratio of activity to errors would be more likely to provide debugging help.

thanks!

Currently the average is almost 2 per second and it's been fairly stable for a while. This is a lot of errors. Should we report an hhvm bug about not respecting the connection timeout setting?

@mmodell Yes, this is causing real problems: for example, we were unable to detect T101182 for a while because of the noise this was creating (which had a very similar pattern).

But I do not see any configuration change on hhvm on our side for the last patch:

root@mw1026:/etc$ grep -R "mysql.con" *
Binary file alternatives/php matches
apparmor.d/usr.sbin.mysqld:  /etc/mysql/conf.d/ r,
apparmor.d/usr.sbin.mysqld:  /etc/mysql/conf.d/* r,
grep: blkid.tab: No such file or directory
init/mysql.conf:    ERR_LOGGER="logger -p daemon.err -t /etc/init/mysql.conf -i"
mysql/my.cnf:!includedir /etc/mysql/conf.d/
grep: nologin: No such file or directory
php5/apache2/php.ini:mysql.connect_timeout = 1
php5/cli/php.ini:mysql.connect_timeout = 1

jynus@mw1026:~$ cat test.php
<?php 
echo ini_get('hhvm.mysql.slow_query_threshold')."\n";
echo ini_get('hhvm.mysql.connect_timeout');
?>
jynus@mw1026:~$ hhvm test.php
10000
1000

Only non-hiera conf is there.

Krinkle subscribed.Jun 4 2015, 12:17 PM

While a proper config ini fix would be both welcome and necessary... have discussed with @Joe the possibility of just patching hhvm's hardcoded mysqlExtension::ConnectTimeout to 3000ms in our next build.

Entirely his call though, since he's doing all the work.

We have added what seemed to be the proper hiera configuration:

hhvm::extra::fcgi_settings:
  hhvm:
    mysql:
      connect_timeout: 3000
hhvm::extra::cli_settings:
  hhvm:
    mysql:
      connect_timeout: 3000

but I am not seeing that reflected in the config on the HHVM servers:

$ ssh mw1026.eqiad.wmnet
$ grep mysql /etc/hhvm/fcgi.ini
hhvm.mysql.slow_query_threshold = 10000
hhvm.mysql.typed_results = false

Pretty obviously the config patches I made are not correct for our Puppet process.

Change 215931 had a related patch set uploaded (by Giuseppe Lavagetto):
hhvm: actually set the timeout on normal appserver, restore on canaries

https://gerrit.wikimedia.org/r/215931

Change 215931 merged by Giuseppe Lavagetto:
hhvm: actually set the timeout on normal appserver, restore on canaries

https://gerrit.wikimedia.org/r/215931

Joe mentioned this in rOPUP958d54353002: hhvm: actually set the timeout on normal appserver, restore on canaries.Jun 4 2015, 4:42 PM

The setting is now applied everywhere.

According to logstash the error rate has gone from ~90/minute before the patch to ~2/minute after.

So I assumed the original change didn't work because of something not working in that ini setting. In fact it turned out that the patch for the canaries (which I babysat) had the correct entries, but was later removed, and that the change for normal appservers had just been wrong all along.

I don't think we need to hardcode our preferred value there anymore.

I still need to do a patch so that this value is the default for us from now on.

jcrespo reassigned this task from jcrespo to Joe.Jun 5 2015, 6:55 AM

jcrespo lowered the priority of this task from High to Medium.

jcrespo updated the task description. (Show Details)

jcrespo removed a project: Patch-For-Review.

jcrespo awarded a token.Jun 5 2015, 7:43 AM

hashar added a project: Wikimedia-production-error.Jun 5 2015, 9:03 AM

hashar subscribed.

Thank you everyone, that largely removed the spam we were seeing in logstash! Kudos to everyone involved!

hashar awarded a token.Jun 5 2015, 9:08 AM

hashar moved this task from Untriaged to Jan 2019 / 1.33.wmf13–14 on the Wikimedia-production-error board.Jun 5 2015, 10:39 AM

Joe closed this task as Resolved.Jul 1 2015, 8:09 AM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 1 2015, 8:09 AM

• Springle mentioned this in T104573: codfw frontends cannot connect to mysql at db2029.Jul 3 2015, 12:32 AM

• demon moved this task from Jan 2019 / 1.33.wmf13–14 to Resolved on the Wikimedia-production-error board.Jul 6 2015, 7:16 PM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM

investigate HHVM mysqlExtension::ConnectTimeoutClosed, ResolvedPublicPRODUCTION ERRORActions

Description

Details

Related Objects

Event Timeline

investigate HHVM mysqlExtension::ConnectTimeout
Closed, ResolvedPublicPRODUCTION ERROR
Actions