Page MenuHomePhabricator

codfw frontends cannot connect to mysql at db2029
Closed, ResolvedPublic

Description

This is an example log error, all connections seem to fail:

{
  "_index": "logstash-2015.07.02",
  "_type": "mediawiki",
  "_id": "N9x_4zjSQ_GFOq_TMq7B0Q",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.192.16.17: Can't connect to MySQL server on '10.192.16.17' (4)",
    "@version": 1,
    "@timestamp": "2015-07-02T14:13:59.391Z",
    "type": "mediawiki",
    "host": "mw2206",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/w/api.php",
    "ip": "10.192.33.6",
    "http_method": "GET",
    "server": "en.wikipedia.org",
    "referrer": null,
    "uid": "ff1f951",
    "process_id": 25817,
    "wiki": "enwiki",
    "db_server": "10.192.16.17",
    "db_name": "metawiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": "Can't connect to MySQL server on '10.192.16.17' (4)",
    "normalized_message": "Error connecting to 10.192.16.17: Can't connect to MySQL server on '10.192.16.17' (4)"
  },
  "sort": [
    1435846439391
  ]
}

Curl from the mediawiki hosts to the mysql port works, and grants seem to include the host for the user wikiuser. We will need more investigation to understand why this is failing. skip_name_resolve is enabled too.

Event Timeline

jcrespo raised the priority of this task from to Medium.
jcrespo updated the task description. (Show Details)
jcrespo added projects: acl*sre-team, DBA.
jcrespo subscribed.
jcrespo renamed this task from codfw frontends cannot connect to db2029 to codfw frontends cannot connect to mysql at db2029.Jul 2 2015, 2:20 PM
jcrespo updated the task description. (Show Details)
jcrespo set Security to None.
jcrespo added a subscriber: Springle.

(4) == EINTR on connect. Presumably the max_connections you observed, which in turn possibly something to do with:

  • hhvm timeout (but presumably T98489 was deployed to CODFW, so seems unlikely?)
  • db2029 still has old values for thread_pool_size=8 (now 32) and thread_pool_stall_limit=500 (now 100) which would make it easier to starve and/or stall. It needs a restart to pickup production config changes. Probably other CODFW boxes do too.

Also thread_pool_size=8 is unusually low regardless. The variable defaults to # of cpu cores iirc, which is usually minimum of 16 even for old hardware, so should check if db2029 has hyperthreading disabled.

wmf db2029 3306 root (none)> show global variables like 'thread%';
+---------------------------+-----------------+
| Variable_name             | Value           |
+---------------------------+-----------------+
| thread_cache_size         | 300             |
| thread_concurrency        | 10              |
| thread_handling           | pool-of-threads |
| thread_pool_idle_timeout  | 60              |
| thread_pool_max_threads   | 500             |
| thread_pool_oversubscribe | 3               |
| thread_pool_size          | 8               |
| thread_pool_stall_limit   | 500             |
| thread_stack              | 196608          |
+---------------------------+-----------------+
9 rows in set (0.30 sec)
wmf db1072 3306 root (none)> show global variables like 'thread%';
+---------------------------+-----------------+
| Variable_name             | Value           |
+---------------------------+-----------------+
| thread_cache_size         | 300             |
| thread_concurrency        | 10              |
| thread_handling           | pool-of-threads |
| thread_pool_idle_timeout  | 60              |
| thread_pool_max_threads   | 500             |
| thread_pool_oversubscribe | 3               |
| thread_pool_size          | 32              |
| thread_pool_stall_limit   | 100             |
| thread_stack              | 196608          |
+---------------------------+-----------------+
9 rows in set (0.27 sec)
  • Some other network issue.

Network conectivity is ok (I cannot discard it being too slow or other problem)- I can curl to the mysql port and I can see the connections initiating on netstat.

There was effectively a max_connections issue, but only after the avalanche maximized the connections.

I think *all connections fail* or all from a subset of servers- the errors start immediately the second it is repooled and the processlist show the connections as "not authenticated" yet. I do not think it is a "configuration makes it too slow", specially given that the load is not that excessive (they are only health checks) but obviously I will test above suggestion first (which has to be changed anyway).

Note that ALL servers on codfw started receiving small traffic recently, but only this one showed this symptom.

A bunch of "unauthenticated user" in processlist still makes me suspect the thread pool, since that symptom has been seen on prod slaves with thread_pool_size=16 (but not the immediate all-connections-fail, which is indeed odd).

$0.02

I've reseted configuration to puppet defaults, upgraded and restarted the server and now it seems it works as it should.

Upgrade and reaplication of grants fixed the issue.