db1114 is having more connection errors than all other servers. https://logstash.wikimedia.org/goto/fe57fdbf7cdd60e3e2114ef27fa255b9
{F16917996}
At first I thought it was some missing grants (eg. missing or using old_passwords), so I reloaded the grants. The errors keep happening.
Then I tried increasing the connection pool size, to see if there was too much delay on getting a connection. That also doesn't seem to affect.
Needs research, but I don't want to depool it because connections seem to succeed and execute queries, only a percentage (but a high ones) seems to fail to connect.
**UPDATE of what have been seen/debugged**
As there are lots of comments, this is a sum up of the fact that we have seen or debugged.
- The traffic spikes happen every 10 minutes, even to the second. These are examples of logged errors
```
05:20:10
05:30:10
05:40:12
```
This can been seen at: https://logstash.wikimedia.org/goto/7e5f62bb94f1c05ef1d1e29e443ed5fd
- While those spikes happen, the server drop packets - this has been confirmed by looking at the `ifconfig` output after every burst.
- While these errors happen. tcpdump doesn't show any traffic from terbium or tendril. Just traffic coming from mw hosts. Almost double the traffic during those seconds, but "normal" traffic apparently.
- These errors only happen on db1114 while it servers API traffic. If the server is removed from API traffic and only servers main, this doesn't happen.
- This server is the only one suffering this. Neither db1066 or db1080 (other enwiki api slaves) suffer this. Not even when db1114 is depooled and they assume its traffic.
- db1114 is running stretch and mariadb 10.1. db1066 and db1080 run jessie and mariadb 10.0
- db1114 has the same schema definition in all the tables than db1066 and db1080.
- No HW errors found on the idrac
Next tests:
[] use NIC #2 instead of #1 and monitor its behaviour
[] reimage the host to jessie