There is high number of connection errors to 10.64.16.144 (db1049, or s5-master) caused probably by high number of connections such as:
{
"_index": "logstash-2016.09.19",
"_type": "mediawiki",
"_id": "AVdD7nZU2B4w3SKhWGu0",
"_score": null,
"_source": {
"message": "Error connecting to 10.64.16.144: Can't connect to MySQL server on '10.64.16.144' (4)",
"@version": 1,
"@timestamp": "2016-09-19T19:31:23.000Z",
"type": "mediawiki",
"host": "mw1299",
"level": "ERROR",
"tags": [
"syslog",
"es",
"es"
],
"channel": "wfLogDBError",
"normalized_message": "Error connecting to {db_server}: {error}",
"url": "/rpc/RunJobs.php?wiki=dewiki&type=wikibase-addUsagesForPage&maxtime=60&maxmem=300M",
"ip": "127.0.0.1",
"http_method": "POST",
"server": "127.0.0.1",
"referrer": null,
"wiki": "dewiki",
"mwversion": "1.28.0-wmf.18",
"reqId": "a7a0b44622b72ad31d160f12",
"db_server": "10.64.16.144",
"db_name": "dewiki",
"db_user": "wikiuser",
"method": "DatabaseMysqlBase::open",
"error": "Can't connect to MySQL server on '10.64.16.144' (4)"
},
"fields": {
"@timestamp": [
1474313483000
]
},
"highlight": {
"channel.raw": [
"@kibana-highlighted-field@wfLogDBError@/kibana-highlighted-field@"
]
},
"sort": [
1474313483000
]
}This could be the cause or just a consequence because this job is very common.
Here is a sample of errors: https://logstash.wikimedia.org/goto/11cd5759017b61d371035de09a41531c
The number was high before, but at 17:00 UTC today there was a spike of 2000 errors in 5 minutes, following by a continuous tail of ~100 errors/5 minutes. This could be just a spike on activity that will disappear, or could be something substantial (code pattern change).
75% of current database errors are connection errors to this server (not normal). By looking at https://grafana-admin.wikimedia.org/dashboard/db/mysql?from=1473709162759&to=1474313962759&var-dc=eqiad%20prometheus%2Fops&var-server=db1049 I can see there is a pattern change at 6-8 UTC today, and increasing highly at 17 UTC, but see nothing strange at infrastructure side.