There is high number of connection errors to 10.64.16.144 (db1049, or s5-master) caused probably by high number of connections such as:
{ "_index": "logstash-2016.09.19", "_type": "mediawiki", "_id": "AVdD7nZU2B4w3SKhWGu0", "_score": null, "_source": { "message": "Error connecting to 10.64.16.144: Can't connect to MySQL server on '10.64.16.144' (4)", "@version": 1, "@timestamp": "2016-09-19T19:31:23.000Z", "type": "mediawiki", "host": "mw1299", "level": "ERROR", "tags": [ "syslog", "es", "es" ], "channel": "wfLogDBError", "normalized_message": "Error connecting to {db_server}: {error}", "url": "/rpc/RunJobs.php?wiki=dewiki&type=wikibase-addUsagesForPage&maxtime=60&maxmem=300M", "ip": "127.0.0.1", "http_method": "POST", "server": "127.0.0.1", "referrer": null, "wiki": "dewiki", "mwversion": "1.28.0-wmf.18", "reqId": "a7a0b44622b72ad31d160f12", "db_server": "10.64.16.144", "db_name": "dewiki", "db_user": "wikiuser", "method": "DatabaseMysqlBase::open", "error": "Can't connect to MySQL server on '10.64.16.144' (4)" }, "fields": { "@timestamp": [ 1474313483000 ] }, "highlight": { "channel.raw": [ "@kibana-highlighted-field@wfLogDBError@/kibana-highlighted-field@" ] }, "sort": [ 1474313483000 ] }
This could be the cause or just a consequence because this job is very common.
Here is a sample of errors: https://logstash.wikimedia.org/goto/11cd5759017b61d371035de09a41531c
The number was high before, but at 17:00 UTC today there was a spike of 2000 errors in 5 minutes, following by a continuous tail of ~100 errors/5 minutes. This could be just a spike on activity that will disappear, or could be something substantial (code pattern change).
75% of current database errors are connection errors to this server (not normal). By looking at https://grafana-admin.wikimedia.org/dashboard/db/mysql?from=1473709162759&to=1474313962759&var-dc=eqiad%20prometheus%2Fops&var-server=db1049 I can see there is a pattern change at 6-8 UTC today, and increasing highly at 17 UTC, but see nothing strange at infrastructure side.