Page MenuHomePhabricator

HAProxy 2.6.10 crashing in the text cluster
Closed, ResolvedPublic


HAProxy is randomly crashing in the text cluster, the following log line can be seen on the journal log for HAProxy when the crash happens:

Mar 22 14:44:48 cp1077 haproxy[1570256]: [ALERT]    (1570256) : A bogus STREAM [0x7f46f92a3480] is spinning at 142995 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f46f92a3480,14284a src=REDACTED fe=tls be=tls dst=backend_server_7 txn=0x7f46f8bd1bd0,40000 txn.req=MSG_ERROR,d txn.rsp=MSG_DONE,d rqf=c4a064 rqa=0 rpf=c0048000 rpa=0 scf=0x7f46f862f0b0,EST,0 scb=0x7f46f8a4cf90,EST,1 af=(nil),0 sab=(nil),0 cof=0x7f46e0f4bc50,801c0300:H2(0x7f46f8a01220)/SSL(0x7f46f8510d90)/tcpv4(20487) cob=0x7f47045453e0,80300:H1(0x7f46e0ec2320)/RAW((nil))/unix_stream(30778) filters={}]

Apparently every text node is being affected and we aren't getting a single crash on upload:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text' 'fgrep haproxy /var/log/kern.log |wc -l'
48 hosts will be targeted:
OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48
===== NODE GROUP =====                                                                                                                                                            
(1) cp5019.eqsin.wmnet                                                                                                                                                            
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(5) cp[6009-6010,6016].drmrs.wmnet,cp[3050,3058].esams.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(2) cp[4038,4040].ulsfo.wmnet                                                                                                                                                     
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(6) cp[2033,2035,2037,2039].codfw.wmnet,cp[4041,4044].ulsfo.wmnet                                                                                                                 
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(5) cp[2029,2041].codfw.wmnet,cp[4037,4039,4043].ulsfo.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(3) cp1089.eqiad.wmnet,cp5024.eqsin.wmnet,cp3052.esams.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(6) cp[2027,2031].codfw.wmnet,cp1083.eqiad.wmnet,cp[5020-5021].eqsin.wmnet,cp4042.ulsfo.wmnet                                                                                     
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(5) cp1075.eqiad.wmnet,cp[5017,5022].eqsin.wmnet,cp[3056,3062].esams.wmnet                                                                                                        
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(7) cp[6013-6015].drmrs.wmnet,cp[1077,1081,1085].eqiad.wmnet,cp3054.esams.wmnet                                                                                                   
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
===== NODE GROUP =====                                                                                                                                                            
(8) cp[6011-6012].drmrs.wmnet,cp[1079,1087].eqiad.wmnet,cp[5018,5023].eqsin.wmnet,cp[3060,3064].esams.wmnet                                                                       
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [00:03<00:00, 14.16hosts/s]
FAIL |                                                                                                                                           |   0% (0/48) [00:03<?, ?hosts/s]
100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'fgrep haproxy /v.../kern.log |wc -l'.
100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-upload' 'fgrep haproxy /var/log/kern.log |wc -l'
48 hosts will be targeted:
OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48
===== NODE GROUP =====                                                                                                                                                            
(48) cp[2028,2030,2032,2034,2036,2038,2040,2042].codfw.wmnet,cp[6001-6008].drmrs.wmnet,cp[1076,1078,1080,1082,1084,1086,1088,1090].eqiad.wmnet,cp[5025-5032].eqsin.wmnet,cp[3051,3
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       

Related Objects

Event Timeline

Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2023-03-22T16:35:40Z] <vgutierrez> rolling downgrade to HAProxy 2.6.9 in text@esams - T332796

Gathering data on esams after downgrading:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text_esams' 'apt-cache policy haproxy|grep Installed'
8 hosts will be targeted:
OK to proceed on 8 hosts? Enter the number of affected hosts to confirm or "q" to quit: 8
===== NODE GROUP =====                                                                                                                                                            
(8) cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet                                                                                                                       
----- OUTPUT of 'apt-cache policy...y|grep Installed' -----                                                                                                                       
  Installed: 2.6.9-1~bpo10+1

Mentioned in SAL (#wikimedia-operations) [2023-03-23T08:04:26Z] <vgutierrez> rolling rollback to HAProxy 2.6.9 in cache text cluster - T332796

Mentioned in SAL (#wikimedia-operations) [2023-03-23T08:11:26Z] <vgutierrez> testing HAProxy 2.6.11 in cp4044 - T332796

cp4044 was missing the haproxy 2.6.9 in /var/cache/apt/archives and 2.6.11 has been released, so I'm testing it there right now

Change 902303 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::haproxy,prometheus: Track unexpected restarts

Change 902310 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] prometheus: Track systemd service restarts on >= bullseye

Change 902310 merged by Vgutierrez:

[operations/puppet@production] prometheus: Track systemd service restarts on >= bullseye

Change 902303 abandoned by Vgutierrez:

[operations/puppet@production] cache::haproxy,prometheus: Track unexpected restarts


Followed the approach suggested by Filippo in I29c3b6c0ce13ce09f12f2bff3603f3d1d0df2dd1

Change 902312 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/alerts@master] traffic: Add HAProxyRestarted alert

Change 902312 merged by Vgutierrez:

[operations/alerts@master] traffic: Add HAProxyRestarted alert

Change 902319 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/alerts@master] traffic: Fix HAProxyRestarted dashboard URL

Mentioned in SAL (#wikimedia-operations) [2023-03-23T11:47:17Z] <vgutierrez> rolling rollback to HAProxy 2.6.9 in cache upload cluster - T332796

Vgutierrez lowered the priority of this task from High to Medium.Mar 23 2023, 11:57 AM introduced in HAProxy 2.6.10 has been identified as the potential culprit by upstream. Text is already running 2.6.9, and I'm currently downgrading upload for consistency's sake even if the bug isn't triggered there

Change 902319 merged by Vgutierrez:

[operations/alerts@master] traffic: Fix HAProxyRestarted dashboard URL

alerting is in place to avoid this kind of issue upon HAProxy updates in the future and everything is down to 2.6.9:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'apt-cache policy haproxy|grep Installed'
96 hosts will be targeted:
OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96
===== NODE GROUP =====                                                                                                                                                            
(96) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet                  
----- OUTPUT of 'apt-cache policy...y|grep Installed' -----                                                                                                                       
  Installed: 2.6.9-1~bpo10+1                                                                                                                                                      

Mentioned in SAL (#wikimedia-operations) [2023-03-23T15:12:08Z] <vgutierrez> testing haproxy_2.6.11-1~bpo11+wmf2_amd64.deb in text@ulsfo - T332796

2.6.12 has been released including the patch that we've been testing in text@ulsfo