Page MenuHomePhabricator

Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035
Closed, ResolvedPublic

Description

After restarting db1035 (10.64.32.24), connections to that host, one of the 2 main read-only servers of s3 started to fail.

This was supposedly a database/operations problem only- MySQL is cold on restart, and it has to open the table definitions and calculate the innodb statistics for every table (and there are a lot of them on s3).

However, 3 days after the reboot, connection errors still happen and they have also been observed on other hosts. This has led me to believe that there could be (not sure) something more software-related about queue execution, and that the reboot did only amplified an existing problem. The reason for this is because:

  • Only jobs fail, not regular user connections or api connections
  • The spikes come in groups of 500-1000 every hour
{
  "_index": "logstash-2015.07.27",
  "_type": "mediawiki",
  "_id": "AU7Q44-0f2hURINuOV0X",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.64.48.22: Can't connect to MySQL server on '10.64.48.22' (4)",
    "@version": 1,
    "@timestamp": "2015-07-27T19:00:51.123Z",
    "type": "mediawiki",
    "host": "mw1012",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es",
      "normalized_message_untrimmed"
    ],
    "channel": "wfLogDBError",
    "url": "/rpc/RunJobs.php?wiki=nowiki&type=cirrusSearchIncomingLinkCount&maxtime=30&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "uid": "34e49b2",
    "process_id": 31571,
    "wiki": "nowiki",
    "db_server": "10.64.48.22",
    "db_name": "nowiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": "Can't connect to MySQL server on '10.64.48.22' (4)",
    "normalized_message": "Error connecting to 10.64.48.22: Can't connect to MySQL server on '10.64.48.22' (4)"
  },
  "sort": [
    1438023651123
  ]
}

Event Timeline

jcrespo created this task.Jul 27 2015, 7:24 PM
jcrespo claimed this task.
jcrespo raised the priority of this task from to Normal.
jcrespo updated the task description. (Show Details)
jcrespo added projects: acl*sre-team, DBA.
jcrespo added a subscriber: jcrespo.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 27 2015, 7:24 PM
jcrespo updated the task description. (Show Details)Jul 27 2015, 7:25 PM
jcrespo set Security to None.
jcrespo added a subscriber: Springle.

All errors seem caused by /rpc/RunJobs.php, but with no a specific parameter in particular: https://logstash.wikimedia.org/#dashboard/temp/AU7Q-z73a9yoTVgv1gpk

The large spikes are no longer, but the issue persist almost over all shards, with much lower numbers: 17-2 per hours and server.

There is some timeout that forces things getting worse for a recently-restarted server.

jcrespo moved this task from Triage to In progress on the DBA board.Jul 28 2015, 1:20 PM
jcrespo added a subscriber: aaron.Jul 28 2015, 2:32 PM

I can confirm that this is no longer an s3 only problem. It is affecting s1 and s4, too.

jcrespo moved this task from In progress to Backlog on the DBA board.Aug 27 2015, 1:40 PM

@akosiaris @faidon, only CCing you to this ticket to actually discard the physical rack cause of the issues, as my first detection (this ticket) was on a complete separate server. So this is probably just software/query pattern related, and only happens to happen more on that row because it happens to have more traffic than the others.

T111992 And this confirms my suspicious about the traffic.

jcrespo added a comment.EditedOct 25 2015, 12:08 PM

db1035 continues having issues:

Compare db1044 (same shard, hardware and weight):

to db1035:

jcrespo renamed this task from Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" to Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035.Oct 29 2015, 10:38 AM

This has not happened since I upgraded and defragemented on db1035, but it is too soon to say something. Monitor periodically: https://logstash.wikimedia.org/#dashboard/temp/AVEP7sdtptxhN1XalRaC

jcrespo lowered the priority of this task from Normal to Low.Nov 24 2015, 2:49 PM

Nah, it is still happening, although in a smaller sizes (spike of 338 requests on db1035 at 15:05 today).

The issue persists.

Krinkle renamed this task from Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035 to Job runners throwing "Can't connect to MySQL server" - mainly on db1035.Dec 15 2015, 10:38 PM
Krinkle added a subscriber: Krinkle.

Also happening about once every few minutes for parsercache servers

Last 3 hours in requests to /rpc/RunJobs.php on mediawiki-errors dashboard in logstash:

12x Error connecting to 10.64.16.157: Can't connect to MySQL server on '10.64.16.157' (4)	 - pc1002
12x Error connecting to 10.64.16.156: Can't connect to MySQL server on '10.64.16.156' (4)	 - pc1001
11x Error connecting to 10.64.16.158: Can't connect to MySQL server on '10.64.16.158' (4)	 - pc1003

@Krinke, I do not think it is related, it should be filed as a different bug. There is an increase of 3x the traffic on pc servers since 2015-12-15 ~17h.

jcrespo renamed this task from Job runners throwing "Can't connect to MySQL server" - mainly on db1035 to Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - mainly on db1035.Dec 16 2015, 9:11 AM
jcrespo closed this task as Resolved.Dec 16 2015, 10:25 AM

This issue is solved, there was something operational wrong and that was solved after upgrading/fixing the wikidata-core bug, there is now an ongoing, different issue due to too many connections.