Page MenuHomePhabricator

nf_conntrack: table full errors on Eqiad Job Runners
Closed, ResolvedPublic

Description

Hi!

As part of https://phabricator.wikimedia.org/T123675 I re-imaged rdb1005.eqiad with Debian Jessie following this procedure: https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/JobQueue

After updating the Job Runner puppet config I noticed a lot of mediawiki-errors from logstash, all of them triggered by a subset of jobrunners. The main errors seems to be:

Could not connect to server rdb100X.eqiad.wmnet:63XX

for various rdb hosts and ports (so not only rdb1005). Syslog on the affected jobrunners shows tons of these:

Mar 18 12:48:17 mw1166 kernel: [ 4173.267651] nf_conntrack: table full, dropping packet

After a chat with Moritz we realized that the number of connections on some jobrunners are REALLY high and close to the limits:

elukey@neodymium:~$ sudo -i salt -C 'G@cluster:jobrunner and G@site:eqiad' cmd.run 'netstat -tunap | wc -l'

mw1004.eqiad.wmnet:
    65819
mw1011.eqiad.wmnet:
    58098
mw1009.eqiad.wmnet:
    65905
[...]
mw1169.eqiad.wmnet:
    212755
mw1167.eqiad.wmnet:
    202466
mw1164.eqiad.wmnet:
    213466
mw1163.eqiad.wmnet:
    189114

More specifically, it seems that only mw116* hosts are affected.

Event Timeline

elukey@mw1163:~$ sudo netstat -tunap | awk '{print $6}' | sort | uniq -c
     14 -
      1 1073/python
      1 1152/hhvm
      5 1205/rsyslogd
      8 1964/ntpd
      1 25621/php
      1 36353/php
      4 867/rpcbind
      1 established)
    315 ESTABLISHED
      1 FIN_WAIT2
      1 Foreign
     13 LISTEN
     14 SYN_SENT
 213719 TIME_WAIT
elukey@mw1161:~$ sudo netstat -tuap | grep TIME_WAIT | awk '{print $5}' | sort | uniq -c
      8 2620:0:861:101:10::6379
      8 2620:0:861:103:10::6379
    688 db1015.eqiad.wmne:mysql
    179 db1018.eqiad.wmne:mysql
      5 db1019.eqiad.wmne:mysql
    285 db1022.eqiad.wmne:mysql
    147 db1023.eqiad.wmne:mysql
    456 db1024.eqiad.wmne:mysql
      7 db1026.eqiad.wmne:mysql
      1 db1027.eqiad.wmne:mysql
      1 db1028.eqiad.wmne:mysql
     11 db1029.eqiad.wmne:mysql
     10 db1031.eqiad.wmne:mysql
    110 db1033.eqiad.wmne:mysql
      3 db1034.eqiad.wmne:mysql
   2901 db1035.eqiad.wmne:mysql
      6 db1036.eqiad.wmne:mysql
      6 db1037.eqiad.wmne:mysql
    138 db1038.eqiad.wmne:mysql
    796 db1039.eqiad.wmne:mysql
    336 db1040.eqiad.wmne:mysql
    621 db1041.eqiad.wmne:mysql
   2938 db1044.eqiad.wmne:mysql
     88 db1045.eqiad.wmne:mysql
    631 db1050.eqiad.wmne:mysql
    196 db1051.eqiad.wmne:mysql
    156 db1052.eqiad.wmne:mysql
    193 db1055.eqiad.wmne:mysql
    198 db1056.eqiad.wmne:mysql
    259 db1057.eqiad.wmne:mysql
    625 db1058.eqiad.wmne:mysql
    191 db1059.eqiad.wmne:mysql
    749 db1060.eqiad.wmne:mysql
    828 db1061.eqiad.wmne:mysql
   1400 db1062.eqiad.wmne:mysql
   1328 db1063.eqiad.wmne:mysql
    456 db1064.eqiad.wmne:mysql
    214 db1065.eqiad.wmne:mysql
    213 db1066.eqiad.wmne:mysql
   1586 db1067.eqiad.wmne:mysql
    408 db1068.eqiad.wmne:mysql
    751 db1070.eqiad.wmne:mysql
    746 db1071.eqiad.wmne:mysql
    383 db1072.eqiad.wmne:mysql
    398 db1073.eqiad.wmne:mysql
      6 es1011.eqiad.wmne:mysql
      5 es1013.eqiad.wmne:mysql
      6 es1014.eqiad.wmne:mysql
      5 es1017.eqiad.wmne:mysql
      1 es1019.eqiad.wmne:mysql
    155 kafka1012.eqiad.wm:9092
    129 kafka1013.eqiad.wm:9092
    130 kafka1014.eqiad.wm:9092
    139 kafka1018.eqiad.wm:9092
    151 kafka1020.eqiad.wm:9092
    115 kafka1022.eqiad.wm:9092
      1 localhost:35480
      1 localhost:38590
      1 localhost:45139
      1 localhost:45670
      1 localhost:45676
      1 localhost:48779
      1 localhost:48783
      1 localhost:52671
      1 localhost:53247
      1 localhost:59883
      1 localhost:59890
  15935 localhost:9000
      2 localhost:9001
  15930 localhost:9005
      2 ms-fe.svc.eqiad.wm:http
      4 neodymium.eqiad.wm:4506
      1 neon.wikimedia.or:18313
      1 neon.wikimedia.or:56708
    137 pc1004.eqiad.wmne:mysql
    142 pc1005.eqiad.wmne:mysql
    135 pc1006.eqiad.wmne:mysql
  15063 rdb1001.eqiad.wmne:6379
  15069 rdb1001.eqiad.wmne:6380
  15083 rdb1001.eqiad.wmne:6381
  14615 rdb1003.eqiad.wmne:6379
  14602 rdb1003.eqiad.wmne:6380
  14502 rdb1003.eqiad.wmne:6381
  14417 rdb1005.eqiad.wmne:6379
  14404 rdb1005.eqiad.wmne:6380
  14455 rdb1005.eqiad.wmne:6381
  14382 rdb1007.eqiad.wmne:6379
  13559 rdb1007.eqiad.wmne:6380
  14120 rdb1007.eqiad.wmne:6381
    253 restbase.svc.eqiad:7231
    406 search.svc.codfw.w:9200
    418 search.svc.eqiad.w:9200

elukey@mw1161:~$ sudo netstat -tuap | grep TIME_WAIT | wc -l
154136

Change 278286 had a related patch set uploaded (by Elukey):
Enable persistent connections between Job Queues and Job Runners.

https://gerrit.wikimedia.org/r/278286

Change 278286 merged by jenkins-bot:
Enable persistent connections between Job Queues and Job Runners.

https://gerrit.wikimedia.org/r/278286

Yes, I'd say it's pretty clear we're seeing an issue in how redis persistent connections are handled by HHVM 3.12

I've have merged a puppet change to bump the connection table on the job runners to 512k (it's only effective with the next reboot, but I bumped the value manually when re-enabling puppet on mw1161-mw1169.

Mentioned in SAL [2016-03-18T16:30:47Z] <moritzm> bumped connection tracking table size on mw1161-mw1169 to 524288 to cope with currently elevated connections on those (T130364)

Change 278319 had a related patch set uploaded (by Ori.livneh):
Reduce the number of jobrunner procs on mw11* hosts

https://gerrit.wikimedia.org/r/278319

Change 278319 merged by Ori.livneh:
Reduce the number of jobrunner procs on mw11* hosts

https://gerrit.wikimedia.org/r/278319

Change 278326 had a related patch set uploaded (by Ori.livneh):
Adjust port range of ferm rule for redis on app servers

https://gerrit.wikimedia.org/r/278326

Change 278327 had a related patch set uploaded (by Ori.livneh):
Enable reuse of sockets in TIME_WAIT state on all app servers

https://gerrit.wikimedia.org/r/278327

Change 278326 merged by Ori.livneh:
Adjust port range of ferm rule for redis on app servers

https://gerrit.wikimedia.org/r/278326

Change 278327 merged by Ori.livneh:
Enable reuse of sockets in TIME_WAIT state on all app servers

https://gerrit.wikimedia.org/r/278327

ori claimed this task.