HAProxy 2.6.10 crashing in the text cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Vgutierrez
	Mar 22 2023, 4:14 PM

Description

HAProxy is randomly crashing in the text cluster, the following log line can be seen on the journal log for HAProxy when the crash happens:

Mar 22 14:44:48 cp1077 haproxy[1570256]: [ALERT]    (1570256) : A bogus STREAM [0x7f46f92a3480] is spinning at 142995 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x7f46f92a3480,14284a src=REDACTED fe=tls be=tls dst=backend_server_7 txn=0x7f46f8bd1bd0,40000 txn.req=MSG_ERROR,d txn.rsp=MSG_DONE,d rqf=c4a064 rqa=0 rpf=c0048000 rpa=0 scf=0x7f46f862f0b0,EST,0 scb=0x7f46f8a4cf90,EST,1 af=(nil),0 sab=(nil),0 cof=0x7f46e0f4bc50,801c0300:H2(0x7f46f8a01220)/SSL(0x7f46f8510d90)/tcpv4(20487) cob=0x7f47045453e0,80300:H1(0x7f46e0ec2320)/RAW((nil))/unix_stream(30778) filters={}]

Apparently every text node is being affected and we aren't getting a single crash on upload:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text' 'fgrep haproxy /var/log/kern.log |wc -l'
48 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet
OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48
===== NODE GROUP =====                                                                                                                                                            
(1) cp5019.eqsin.wmnet                                                                                                                                                            
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
10                                                                                                                                                                                
===== NODE GROUP =====                                                                                                                                                            
(5) cp[6009-6010,6016].drmrs.wmnet,cp[3050,3058].esams.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
6                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(2) cp[4038,4040].ulsfo.wmnet                                                                                                                                                     
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
1                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(6) cp[2033,2035,2037,2039].codfw.wmnet,cp[4041,4044].ulsfo.wmnet                                                                                                                 
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
3                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(5) cp[2029,2041].codfw.wmnet,cp[4037,4039,4043].ulsfo.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
2                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(3) cp1089.eqiad.wmnet,cp5024.eqsin.wmnet,cp3052.esams.wmnet                                                                                                                      
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
8                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(6) cp[2027,2031].codfw.wmnet,cp1083.eqiad.wmnet,cp[5020-5021].eqsin.wmnet,cp4042.ulsfo.wmnet                                                                                     
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
4                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(5) cp1075.eqiad.wmnet,cp[5017,5022].eqsin.wmnet,cp[3056,3062].esams.wmnet                                                                                                        
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
9                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(7) cp[6013-6015].drmrs.wmnet,cp[1077,1081,1085].eqiad.wmnet,cp3054.esams.wmnet                                                                                                   
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
5                                                                                                                                                                                 
===== NODE GROUP =====                                                                                                                                                            
(8) cp[6011-6012].drmrs.wmnet,cp[1079,1087].eqiad.wmnet,cp[5018,5023].eqsin.wmnet,cp[3060,3064].esams.wmnet                                                                       
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
7                                                                                                                                                                                 
================                                                                                                                                                                  
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [00:03<00:00, 14.16hosts/s]
FAIL |                                                                                                                                           |   0% (0/48) [00:03<?, ?hosts/s]
100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'fgrep haproxy /v.../kern.log |wc -l'.
100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-upload' 'fgrep haproxy /var/log/kern.log |wc -l'
48 hosts will be targeted:
cp[2028,2030,2032,2034,2036,2038,2040,2042].codfw.wmnet,cp[6001-6008].drmrs.wmnet,cp[1076,1078,1080,1082,1084,1086,1088,1090].eqiad.wmnet,cp[5025-5032].eqsin.wmnet,cp[3051,3053,3055,3057,3059,3061,3063,3065].esams.wmnet,cp[4045-4052].ulsfo.wmnet
OK to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit: 48
===== NODE GROUP =====                                                                                                                                                            
(48) cp[2028,2030,2032,2034,2036,2038,2040,2042].codfw.wmnet,cp[6001-6008].drmrs.wmnet,cp[1076,1078,1080,1082,1084,1086,1088,1090].eqiad.wmnet,cp[5025-5032].eqsin.wmnet,cp[3051,3
053,3055,3057,3059,3061,3063,3065].esams.wmnet,cp[4045-4052].ulsfo.wmnet                                                                                                          
----- OUTPUT of 'fgrep haproxy /v.../kern.log |wc -l' -----                                                                                                                       
0                                                                                                                                                                                 
================

Details

Subject	Repo	Branch	Lines +/-
traffic: Fix HAProxyRestarted dashboard URL	operations/alerts	master	+2 -2
traffic: Add HAProxyRestarted alert	operations/alerts	master	+39 -0
cache::haproxy,prometheus: Track unexpected restarts	operations/puppet	production	+62 -0
prometheus: Track systemd service restarts on >= bullseye	operations/puppet	production	+9 -0

Customize query in gerrit

Related Objects

Mentioned In: T334448: HAProxy 2.6.12 segfaults

Event Timeline

Vgutierrez created this task.Mar 22 2023, 4:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2023, 4:14 PM

Vgutierrez triaged this task as High priority.Mar 22 2023, 4:15 PM

Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

BCornwall subscribed.Mar 22 2023, 4:20 PM

reported to upstream in https://github.com/haproxy/haproxy/issues/2088

Maintenance_bot added a project: SRE.Mar 22 2023, 4:29 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-22T16:35:40Z] <vgutierrez> rolling downgrade to HAProxy 2.6.9 in text@esams - T332796

ssingh subscribed.Mar 22 2023, 4:48 PM

Gathering data on esams after downgrading:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp-text_esams' 'apt-cache policy haproxy|grep Installed'
8 hosts will be targeted:
cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet
OK to proceed on 8 hosts? Enter the number of affected hosts to confirm or "q" to quit: 8
===== NODE GROUP =====                                                                                                                                                            
(8) cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet                                                                                                                       
----- OUTPUT of 'apt-cache policy...y|grep Installed' -----                                                                                                                       
  Installed: 2.6.9-1~bpo10+1

Mentioned in SAL (#wikimedia-operations) [2023-03-23T08:04:26Z] <vgutierrez> rolling rollback to HAProxy 2.6.9 in cache text cluster - T332796

Mentioned in SAL (#wikimedia-operations) [2023-03-23T08:11:26Z] <vgutierrez> testing HAProxy 2.6.11 in cp4044 - T332796

cp4044 was missing the haproxy 2.6.9 in /var/cache/apt/archives and 2.6.11 has been released, so I'm testing it there right now

Change 902303 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cache::haproxy,prometheus: Track unexpected restarts

https://gerrit.wikimedia.org/r/902303

gerritbot added a project: Patch-For-Review.Mar 23 2023, 8:46 AM

Change 902310 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] prometheus: Track systemd service restarts on >= bullseye

https://gerrit.wikimedia.org/r/902310

Change 902310 merged by Vgutierrez:

[operations/puppet@production] prometheus: Track systemd service restarts on >= bullseye

https://gerrit.wikimedia.org/r/902310

Change 902303 abandoned by Vgutierrez:

[operations/puppet@production] cache::haproxy,prometheus: Track unexpected restarts

Reason:

Followed the approach suggested by Filippo in I29c3b6c0ce13ce09f12f2bff3603f3d1d0df2dd1

https://gerrit.wikimedia.org/r/902303

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2023, 10:10 AM

Change 902312 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/alerts@master] traffic: Add HAProxyRestarted alert

https://gerrit.wikimedia.org/r/902312

gerritbot added a project: Patch-For-Review.Mar 23 2023, 10:25 AM

Change 902312 merged by Vgutierrez:

[operations/alerts@master] traffic: Add HAProxyRestarted alert

https://gerrit.wikimedia.org/r/902312

Change 902319 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/alerts@master] traffic: Fix HAProxyRestarted dashboard URL

https://gerrit.wikimedia.org/r/902319

Mentioned in SAL (#wikimedia-operations) [2023-03-23T11:47:17Z] <vgutierrez> rolling rollback to HAProxy 2.6.9 in cache upload cluster - T332796

https://github.com/haproxy/haproxy/commit/407210a34d781f8249504557c371c170cb34f93e introduced in HAProxy 2.6.10 has been identified as the potential culprit by upstream. Text is already running 2.6.9, and I'm currently downgrading upload for consistency's sake even if the bug isn't triggered there

Change 902319 merged by Vgutierrez:

[operations/alerts@master] traffic: Fix HAProxyRestarted dashboard URL

https://gerrit.wikimedia.org/r/902319

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2023, 12:10 PM

alerting is in place to avoid this kind of issue upon HAProxy updates in the future and everything is down to 2.6.9:

vgutierrez@cumin1001:~$ sudo -i cumin 'A:cp' 'apt-cache policy haproxy|grep Installed'
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit: 96
===== NODE GROUP =====                                                                                                                                                            
(96) cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet                  
----- OUTPUT of 'apt-cache policy...y|grep Installed' -----                                                                                                                       
  Installed: 2.6.9-1~bpo10+1                                                                                                                                                      
================

Mentioned in SAL (#wikimedia-operations) [2023-03-23T15:12:08Z] <vgutierrez> testing haproxy_2.6.11-1~bpo11+wmf2_amd64.deb in text@ulsfo - T332796

2.6.12 has been released https://www.mail-archive.com/haproxy@formilux.org/msg43371.html including the patch that we've been testing in text@ulsfo

Vgutierrez mentioned this in T334448: HAProxy 2.6.12 segfaults.Apr 11 2023, 2:32 PM

HAProxy 2.6.10 crashing in the text clusterClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

HAProxy 2.6.10 crashing in the text cluster
Closed, ResolvedPublic
Actions