Hi!
As part of https://phabricator.wikimedia.org/T123675 I re-imaged rdb1005.eqiad with Debian Jessie following this procedure: https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/JobQueue
After updating the Job Runner puppet config I noticed a lot of mediawiki-errors from logstash, all of them triggered by a subset of jobrunners. The main errors seems to be:
Could not connect to server rdb100X.eqiad.wmnet:63XX
for various rdb hosts and ports (so not only rdb1005). Syslog on the affected jobrunners shows tons of these:
Mar 18 12:48:17 mw1166 kernel: [ 4173.267651] nf_conntrack: table full, dropping packet
After a chat with Moritz we realized that the number of connections on some jobrunners are REALLY high and close to the limits:
elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'netstat -tunap | wc -l' mw1004.eqiad.wmnet: 65819 mw1011.eqiad.wmnet: 58098 mw1009.eqiad.wmnet: 65905 [...] mw1169.eqiad.wmnet: 212755 mw1167.eqiad.wmnet: 202466 mw1164.eqiad.wmnet: 213466 mw1163.eqiad.wmnet: 189114
More specifically, it seems that only mw116* hosts are affected.