Our current puppetization touches remap.config and reloads the TS service to reload modified lua scripts, this wasn't enough to apply the changes introduced by https://gerrit.wikimedia.org/r/#/q/I6c7c473cfcd3ce3fc3b7c0552018badf5fb1c9aa making necessary a service restart to make the changes effective.
Description
Details
Related Objects
Event Timeline
The ability to reload global lua scripts has been added in 9.x: https://github.com/apache/trafficserver/commit/e6147753cd65c3edd32b365e09b4d65edcffdd01
This explains why reloading a global tls.lua script does not work. We might want to consider backporting that patch.
Reloading per-remap lua scripts works just fine on 8.x, provided that remap.config has changed (ie: touch remap.config ; traffic_ctl config reload). We need to update our puppetization accordingly.
nice catch, I have to backport the SSL EC cache PR... so I guess I'll include this one as well
hmmm it looks like the commit is on 8.x but the feature is disabled by default, see https://github.com/apache/trafficserver/pull/3940/commits/cc7b7dd7540fb33d93aa969defb416461c0202e7
Change 543022 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable reloading global lua script
Change 543022 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable reloading global lua script
Mentioned in SAL (#wikimedia-operations) [2019-10-22T06:32:00Z] <vgutierrez> rolling restart of ats-tls - T233274 T234803
Change 545522 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: do not pass enable-reload to tslua
The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb
Change 545522 merged by Ema:
[operations/puppet@production] ATS: do not pass enable-reload to tslua
Mentioned in SAL (#wikimedia-operations) [2019-10-23T10:46:04Z] <ema> cp-ats: rolling ATS backend restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545522/ T233274
Mentioned in SAL (#wikimedia-operations) [2019-10-23T12:31:22Z] <vgutierrez> restarting ats-tls on cache text nodes - T233274
Change 552201 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: enable reload for global Lua script
Change 552201 merged by Ema:
[operations/puppet@production] ATS: enable reload for global Lua script
Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:03:03Z] <ema> cp1075: ats-backend-restart to enable lua reload T233274
Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:11:09Z] <ema> cp1075: ats-tls-restart to enable lua reload T233274
Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:20:33Z] <ema> cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274
I've debugged locally what we seen yesterday on production with the following lua script:
WEBSOCKET_SUPPORT = nil function __init__(argtb) dofile("/etc/trafficserver/lua/config.lua") ts.debug("On init WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT)) end function __reload__() dofile("/etc/trafficserver/lua/config.lua") ts.debug("On reload WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT)) end function get_websocket_support() return WEBSOCKET_SUPPORT end function do_global_send_request() ts.debug("WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
On ATS start we get the following debug log:
{0x2afa382f8080} DIAG: (ts_lua) On init WEBSOCKET_SUPPORT = true
On each request the following debug line gets logged:
{0x2afa44304700} DIAG: (ts_lua) WEBSOCKET_SUPPORT = true
The weird stuff starts after issuing traffic_ctl config reload:
{0x2afa46d90700} DIAG: (ts_lua) On reload WEBSOCKET_SUPPORT = true
So far so good, but now on each request the following debug line is logged:
{0x2afa44304700} DIAG: (ts_lua) WEBSOCKET_SUPPORT = nil
@Vgutierrez - I really think, reading the Lua plugin code, that __reload__ in 8.0.x might not do what you'd sanely expect (although it is undocumented). I think the __reload__ hook is actually more like a destructor hook for any custom destruct actions you want to happen before the reload into a fresh Lua context. And reload also definitely doesn't hit __init__ either, which leaves the whole model seeming a little broken if you've got a global initialized to nil in its declaration and then initialized to a real value in __init__. Clearly we're missing something here...
Change 552955 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: enable reload for global Lua script
Try adding ts.debug("WEBSOCKET_SUPPORT is set to " .. WEBSOCKET_SUPPORT) after this line and you'll see that the variable is indeed set to nil after __reload__ is called.
HOSTNAME = nil ts.error("HOSTNAME set to nil") function read_config() local confffile = ts.get_config_dir() .. "/default.lua.conf" ts.error("Loading HOSTNAME from " .. confffile) dofile(confffile) ts.error("HOSTNAME is " .. HOSTNAME) assert(HOSTNAME, "Cannot read " .. confffile) end function __init__(argtb) read_config() end function __reload__() read_config() end function do_global_send_response() ts.client_response.header['X-Cache-Int'] = 'gelato: '.. HOSTNAME end
On startup I get:
[Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: HOSTNAME set to nil [Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf [Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: HOSTNAME is banana
On reload:
[Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf [Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: HOSTNAME is banana [Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: HOSTNAME set to nil
Tried this on a test instance instead:
function read_config() local confffile = ts.get_config_dir() .. "/default.lua.conf" ts.error("Loading HOSTNAME from " .. confffile) dofile(confffile) assert(HOSTNAME, "Cannot read " .. confffile) ts.error("HOSTNAME is " .. HOSTNAME) return HOSTNAME end local HOSTNAME = read_config() ts.error("HOSTNAME now set to " .. HOSTNAME) function do_global_send_response() ts.client_response.header['X-Cache-Int'] = 'gelato: '.. HOSTNAME end
On startup:
[Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf [Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: HOSTNAME is banana potato [Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: HOSTNAME now set to banana potato
$ curl -v http://localhost:8080 2>&1 | grep X-Cache-Int < X-Cache-Int: gelato: banana potato
Changed default.lua.conf, issued a reload:
[Nov 26 07:00:34.184] {0x2b7525264700} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf [Nov 26 07:00:34.184] {0x2b7525264700} ERROR: HOSTNAME is banana hehe [Nov 26 07:00:34.184] {0x2b7525264700} ERROR: HOSTNAME now set to banana hehe
$ curl -v http://localhost:8080 2>&1 | grep X-Cache-Int < X-Cache-Int: gelato: banana hehe
Here's my proposed patch to add this behavior to production: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552955/
Change 552955 merged by Ema:
[operations/puppet@production] ATS: enable reload for global Lua script
Mentioned in SAL (#wikimedia-operations) [2019-11-27T11:06:57Z] <ema> cp1075: depool to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552955/ and test tslua reloads T233274
Mentioned in SAL (#wikimedia-operations) [2019-11-27T13:28:16Z] <ema> cp1075: ats-{tls,backend} restarted to apply tslua reload changes T233274
Mentioned in SAL (#wikimedia-operations) [2019-11-27T13:42:25Z] <ema> cp1075: repool with tslua reloads enabled T233274
Change 553344 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use ts.debug instead of error in global Lua scripts
Change 553344 merged by Ema:
[operations/puppet@production] ATS: use ts.debug instead of error in global Lua scripts
Change 553345 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: ensure Lua conf Icinga checks produce output
Change 553345 merged by Ema:
[operations/puppet@production] ATS: ensure Lua conf Icinga checks produce output
Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:04:00Z] <ema> cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274