Page MenuHomePhabricator

HAProxy failing to start on deployment-cache-text08 and deployment-cache-upload08 because of missing `traffic_class.lua` library
Closed, ResolvedPublicBUG REPORT

Description

While processing a Beta Cluster unblock request (T415100) I ran my normal sudo -i puppet agent -tv; sudo -i service haproxy reload commands on the cache nodes. Both are now reporting that HAProxy is crashing rather than restarting.

  • Additional Puppet runs show no changes.
  • haproxy -f /etc/haproxy/haproxy.cfg -c reports "Configuration file is valid"
  • Rebooting deployment-cache-text08 had no effect

Event Timeline

bd808 triaged this task as High priority.Tue, Jan 20, 7:24 PM
Jan 20 16:51:40 deployment-cache-text08 systemd[1]: haproxy.service: Control process exited, code=exited, status=1/FAILURE
Jan 20 16:51:40 deployment-cache-text08 systemd[1]: Reload failed for HAProxy Load Balancer.

So we see the first failure at 16:51 but the only thing that is related is https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/16e4348934a238041087699d588b8f3d09abb9c7, which was pushed at 17:09, so it is unlikely to be that anyway? The journal is also not being very helpful.

sukhe@deployment-cache-text08:~$ sudo haproxy -f /etc/haproxy/conf.d/tls.cfg 
[NOTICE]   (17847) : haproxy version is 2.8.18-1~bpo11+1
[NOTICE]   (17847) : path to executable is /usr/sbin/haproxy
[ALERT]    (17847) : config : parsing [/etc/haproxy/conf.d/tls.cfg:164] : error detected in proxy 'tls' while parsing 'http-request set-var(req.provenance,ifnotexists,ifnotempty)' rule : unknown fetch method 'lua.fetch_isp'.
[ALERT]    (17847) : config : Error(s) found in configuration file : /etc/haproxy/conf.d/tls.cfg

@SLyngshede-WMF: See if you can find time to look into this when you come online, or I will tomorrow. Thanks!

Running the config check with all of the config files gives a different error:

bd808@deployment-cache-text08.deployment-prep.eqiad1:~$ sudo haproxy -f /etc/haproxy/haproxy.cfg -f /etc/haproxy/conf.d -c
[NOTICE]   (24908) : haproxy version is 2.8.18-1~bpo11+1
[NOTICE]   (24908) : path to executable is /usr/sbin/haproxy
[ALERT]    (24908) : config : parsing [/etc/haproxy/conf.d/tls.cfg:218]: 'http-request' expects 'wait-for-handshake', 'set-log-level', 'set-nice', 'use-service', 'sc-add-gpc(*)', 'sc-inc-gpc(*)', 'sc-inc-gpc0(*)', 'sc-inc-gpc1(*)', 'sc-set-gpt(*)', 'sc-set-gpt0(*)', 'send-spoe-group', 'do-resolve(*)', 'cache-use', 'add-acl(*)', 'add-header', 'allow', 'auth', 'capture', 'del-acl(*)', 'del-header', 'del-map(*)', 'deny', 'disable-l7-retry', 'early-hint', 'normalize-uri', 'redirect', 'reject', 'replace-header', 'replace-path', 'replace-pathq', 'replace-uri', 'replace-value', 'return', 'set-header', 'set-map(*)', 'set-method', 'set-path', 'set-pathq', 'set-query', 'set-uri', 'strict-mode', 'tarpit', 'track-sc(*)', 'set-timeout', 'wait-for-body', 'set-var-fmt(*)', 'set-var(*)', 'unset-var(*)', 'set-dst', 'set-dst-port', 'set-mark', 'set-src', 'set-src-port', 'set-tos', 'silent-drop', 'set-priority-class', 'set-priority-offset', 'set-bandwidth-limit', 'lua.is_datacenter', 'lua.res_proxy', 'lua.set_contact_info', but got 'lua.check_traffic_class'.
[ALERT]    (24908) : config : Error(s) found in configuration file : /etc/haproxy/conf.d/tls.cfg

I wonder if that lua.check_traffic_class method is coming from a private location in production?

bd808@mbp03:~/projects/wmf/operations/puppet$ git grep check_traffic_class
modules/profile/templates/cache/haproxy/tls_terminator.cfg.erb:    http-request lua.check_traffic_class

I wonder if that lua.check_traffic_class method is coming from a private location in production?

bd808@mbp03:~/projects/wmf/operations/puppet$ git grep check_traffic_class
modules/profile/templates/cache/haproxy/tls_terminator.cfg.erb:    http-request lua.check_traffic_class

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229141 is looking pretty suspicious.

bd808 renamed this task from HAProxy failing to start on deployment-cache-text08 and deployment-cache-upload08 to HAProxy failing to start on deployment-cache-text08 and deployment-cache-upload08 because of missing `traffic_class.lua` library.Tue, Jan 20, 8:29 PM

Change #1229186 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] haproxy: guard call to private function with feature flag

https://gerrit.wikimedia.org/r/1229186

bd808@deployment-cache-text08.deployment-prep.eqiad1:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(f43e818c6d) git-sync-upstream - haproxy: guard call to private function with feature flag'
Notice: /Stage[main]/Profile::Cache::Haproxy/Haproxy::Site[tls]/File[/etc/haproxy/conf.d/tls.cfg]/content:
--- /etc/haproxy/conf.d/tls.cfg 2026-01-20 16:51:38.138281007 +0000
+++ /tmp/puppet-file20260120-30406-xkqdin       2026-01-20 20:38:14.131855754 +0000
@@ -214,8 +214,7 @@
     # bots that honour our UA and so have contact info go in class D, unless set to another value before. This means that requests coming from abusive networks will
     # still get an F score.
     http-request set-var(req.trusted_request,ifnotexists) str(D) if { var(req.ua_class) -m str "robot" }
-    # Temp fix for bot-password
-    http-request lua.check_traffic_class
+

     # Fall back to E if no trust score has been determined yet and set X-Trusted-Request to its
     # final authoritative value.

Notice: /Stage[main]/Profile::Cache::Haproxy/Haproxy::Site[tls]/File[/etc/haproxy/conf.d/tls.cfg]/content: content changed '{sha256}6965c933d38a0595a36c9d36a2b772c66612897acbc484a947a162366141e346' to '{sha256}72bf35baeabd570f9d2880982670185a3b4f417bc7bfb2a3def012330c795096'
Info: /Stage[main]/Profile::Cache::Haproxy/Haproxy::Site[tls]/File[/etc/haproxy/conf.d/tls.cfg]: Scheduling refresh of Service[haproxy]
Notice: /Stage[main]/Haproxy/Systemd::Service[haproxy]/Service[haproxy]/ensure: ensure changed 'stopped' to 'running' (corrective)
Info: /Stage[main]/Haproxy/Systemd::Service[haproxy]/Service[haproxy]: Unscheduling refresh on Service[haproxy]
Notice: Applied catalog in 15.86 seconds
bd808 changed the task status from Open to In Progress.Tue, Jan 20, 8:44 PM
bd808 claimed this task.
bd808 moved this task from Backlog to Puppet errors on the Beta-Cluster-Infrastructure board.
bd808 added a subscriber: Joe.

Cherry-pick has things running again. Hopefully @SLyngshede-WMF or @Joe can make time to review and merge my patch upstream in ops/puppet.git. Thanks for the nudge in the right direction @ssingh.

Change #1229186 merged by Majavah:

[operations/puppet@production] haproxy: guard call to private function with feature flag

https://gerrit.wikimedia.org/r/1229186