HAProxy is randomly crashing in the text cluster, the following log line can be seen on the journal log for HAProxy when the crash happens:
Mar 22 14:44:48 cp1077 haproxy[1570256]: [ALERT] (1570256) : A bogus STREAM [0x7f46f92a3480] is spinning at 142995 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f46f92a3480,14284a src=REDACTED fe=tls be=tls dst=backend_server_7 txn=0x7f46f8bd1bd0,40000 txn.req=MSG_ERROR,d txn.rsp=MSG_DONE,d rqf=c4a064 rqa=0 rpf=c0048000 rpa=0 scf=0x7f46f862f0b0,EST,0 scb=0x7f46f8a4cf90,EST,1 af=(nil),0 sab=(nil),0 cof=0x7f46e0f4bc50,801c0300:H2(0x7f46f8a01220)/SSL(0x7f46f8510d90)/tcpv4(20487) cob=0x7f47045453e0,80300:H1(0x7f46e0ec2320)/RAW((nil))/unix_stream(30778) filters={}]
Apparently every text node is being affected and we aren't getting a single crash on upload:
vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text' 'fgrep haproxy /var/log/kern.log |wc -l' 48 hosts will be targeted: cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48 ===== NODE GROUP ===== (1) cp5019.eqsin.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 10 ===== NODE GROUP ===== (5) cp[6009-6010,6016].drmrs.wmnet,cp[3050,3058].esams.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 6 ===== NODE GROUP ===== (2) cp[4038,4040].ulsfo.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 1 ===== NODE GROUP ===== (6) cp[2033,2035,2037,2039].codfw.wmnet,cp[4041,4044].ulsfo.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 3 ===== NODE GROUP ===== (5) cp[2029,2041].codfw.wmnet,cp[4037,4039,4043].ulsfo.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 2 ===== NODE GROUP ===== (3) cp1089.eqiad.wmnet,cp5024.eqsin.wmnet,cp3052.esams.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 8 ===== NODE GROUP ===== (6) cp[2027,2031].codfw.wmnet,cp1083.eqiad.wmnet,cp[5020-5021].eqsin.wmnet,cp4042.ulsfo.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 4 ===== NODE GROUP ===== (5) cp1075.eqiad.wmnet,cp[5017,5022].eqsin.wmnet,cp[3056,3062].esams.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 9 ===== NODE GROUP ===== (7) cp[6013-6015].drmrs.wmnet,cp[1077,1081,1085].eqiad.wmnet,cp3054.esams.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 5 ===== NODE GROUP ===== (8) cp[6011-6012].drmrs.wmnet,cp[1079,1087].eqiad.wmnet,cp[5018,5023].eqsin.wmnet,cp[3060,3064].esams.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 7 ================ PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [00:03<00:00, 14.16hosts/s] FAIL | | 0% (0/48) [00:03<?, ?hosts/s] 100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'fgrep haproxy /v.../kern.log |wc -l'. 100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-upload' 'fgrep haproxy /var/log/kern.log |wc -l' 48 hosts will be targeted: cp[2028,2030,2032,2034,2036,2038,2040,2042].codfw.wmnet,cp[6001-6008].drmrs.wmnet,cp[1076,1078,1080,1082,1084,1086,1088,1090].eqiad.wmnet,cp[5025-5032].eqsin.wmnet,cp[3051,3053,3055,3057,3059,3061,3063,3065].esams.wmnet,cp[4045-4052].ulsfo.wmnet OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48 ===== NODE GROUP ===== (48) cp[2028,2030,2032,2034,2036,2038,2040,2042].codfw.wmnet,cp[6001-6008].drmrs.wmnet,cp[1076,1078,1080,1082,1084,1086,1088,1090].eqiad.wmnet,cp[5025-5032].eqsin.wmnet,cp[3051,3 053,3055,3057,3059,3061,3063,3065].esams.wmnet,cp[4045-4052].ulsfo.wmnet ----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' ----- 0 ================