codfw frontends cannot connect to mysql at db2029
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jul 2 2015, 2:04 PM

Description

This is an example log error, all connections seem to fail:

{
  "_index": "logstash-2015.07.02",
  "_type": "mediawiki",
  "_id": "N9x_4zjSQ_GFOq_TMq7B0Q",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.192.16.17: Can't connect to MySQL server on '10.192.16.17' (4)",
    "@version": 1,
    "@timestamp": "2015-07-02T14:13:59.391Z",
    "type": "mediawiki",
    "host": "mw2206",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/w/api.php",
    "ip": "10.192.33.6",
    "http_method": "GET",
    "server": "en.wikipedia.org",
    "referrer": null,
    "uid": "ff1f951",
    "process_id": 25817,
    "wiki": "enwiki",
    "db_server": "10.192.16.17",
    "db_name": "metawiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": "Can't connect to MySQL server on '10.192.16.17' (4)",
    "normalized_message": "Error connecting to 10.192.16.17: Can't connect to MySQL server on '10.192.16.17' (4)"
  },
  "sort": [
    1435846439391
  ]
}

Curl from the mediawiki hosts to the mysql port works, and grants seem to include the host for the user wikiuser. We will need more investigation to understand why this is failing. skip_name_resolve is enabled too.

Related Objects

Mentioned In: rOMWC43012df8be70: Fully bring back up db2029 with normal load
rOMWCdc96d3e6feec: Repool db2029 after upgrade and config update
rOMWC609127aa4692: Depool db2029 again
Mentioned Here: T98489: investigate HHVM mysqlExtension::ConnectTimeout

Event Timeline

jcrespo created this task.Jul 2 2015, 2:04 PM

jcrespo raised the priority of this task from to Medium.

jcrespo updated the task description. (Show Details)

jcrespo added projects: acl*sre-team, DBA.

jcrespo subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 2 2015, 2:04 PM

jcrespo mentioned this in rOMWC609127aa4692: Depool db2029 again.Jul 2 2015, 2:05 PM

jcrespo renamed this task from codfw frontends cannot connect to db2029 to codfw frontends cannot connect to mysql at db2029.Jul 2 2015, 2:20 PM

jcrespo updated the task description. (Show Details)

jcrespo set Security to None.

jcrespo added a subscriber: • Springle.

(4) == EINTR on connect. Presumably the max_connections you observed, which in turn possibly something to do with:

hhvm timeout (but presumably T98489 was deployed to CODFW, so seems unlikely?)

db2029 still has old values for thread_pool_size=8 (now 32) and thread_pool_stall_limit=500 (now 100) which would make it easier to starve and/or stall. It needs a restart to pickup production config changes. Probably other CODFW boxes do too.

Also thread_pool_size=8 is unusually low regardless. The variable defaults to # of cpu cores iirc, which is usually minimum of 16 even for old hardware, so should check if db2029 has hyperthreading disabled.

wmf db2029 3306 root (none)> show global variables like 'thread%';
+---------------------------+-----------------+
| Variable_name             | Value           |
+---------------------------+-----------------+
| thread_cache_size         | 300             |
| thread_concurrency        | 10              |
| thread_handling           | pool-of-threads |
| thread_pool_idle_timeout  | 60              |
| thread_pool_max_threads   | 500             |
| thread_pool_oversubscribe | 3               |
| thread_pool_size          | 8               |
| thread_pool_stall_limit   | 500             |
| thread_stack              | 196608          |
+---------------------------+-----------------+
9 rows in set (0.30 sec)

wmf db1072 3306 root (none)> show global variables like 'thread%';
+---------------------------+-----------------+
| Variable_name             | Value           |
+---------------------------+-----------------+
| thread_cache_size         | 300             |
| thread_concurrency        | 10              |
| thread_handling           | pool-of-threads |
| thread_pool_idle_timeout  | 60              |
| thread_pool_max_threads   | 500             |
| thread_pool_oversubscribe | 3               |
| thread_pool_size          | 32              |
| thread_pool_stall_limit   | 100             |
| thread_stack              | 196608          |
+---------------------------+-----------------+
9 rows in set (0.27 sec)

Some other network issue.

Network conectivity is ok (I cannot discard it being too slow or other problem)- I can curl to the mysql port and I can see the connections initiating on netstat.

There was effectively a max_connections issue, but only after the avalanche maximized the connections.

I think *all connections fail* or all from a subset of servers- the errors start immediately the second it is repooled and the processlist show the connections as "not authenticated" yet. I do not think it is a "configuration makes it too slow", specially given that the load is not that excessive (they are only health checks) but obviously I will test above suggestion first (which has to be changed anyway).

Note that ALL servers on codfw started receiving small traffic recently, but only this one showed this symptom.

A bunch of "unauthenticated user" in processlist still makes me suspect the thread pool, since that symptom has been seen on prod slaves with thread_pool_size=16 (but not the immediate all-connections-fail, which is indeed odd).

$0.02

jcrespo moved this task from Triage to Backlog on the DBA board.Jul 6 2015, 7:09 AM

jcrespo mentioned this in rOMWCdc96d3e6feec: Repool db2029 after upgrade and config update.Jul 6 2015, 5:10 PM

I've reseted configuration to puppet defaults, upgraded and restarted the server and now it seems it works as it should.

jcrespo mentioned this in rOMWC43012df8be70: Fully bring back up db2029 with normal load.Jul 6 2015, 5:41 PM

Upgrade and reaplication of grants fixed the issue.

codfw frontends cannot connect to mysql at db2029Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

codfw frontends cannot connect to mysql at db2029
Closed, ResolvedPublic
Actions