Page MenuHomePhabricator

ATS lua script reload doesn't work as expected
Closed, ResolvedPublic

Description

Our current puppetization touches remap.config and reloads the TS service to reload modified lua scripts, this wasn't enough to apply the changes introduced by https://gerrit.wikimedia.org/r/#/q/I6c7c473cfcd3ce3fc3b7c0552018badf5fb1c9aa making necessary a service restart to make the changes effective.

Event Timeline

Vgutierrez triaged this task as Medium priority.Sep 19 2019, 5:34 AM
Vgutierrez moved this task from Backlog to Caching on the Traffic board.

The ability to reload global lua scripts has been added in 9.x: https://github.com/apache/trafficserver/commit/e6147753cd65c3edd32b365e09b4d65edcffdd01

This explains why reloading a global tls.lua script does not work. We might want to consider backporting that patch.

Reloading per-remap lua scripts works just fine on 8.x, provided that remap.config has changed (ie: touch remap.config ; traffic_ctl config reload). We need to update our puppetization accordingly.

nice catch, I have to backport the SSL EC cache PR... so I guess I'll include this one as well

Change 543022 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] ATS: Enable reloading global lua script

https://gerrit.wikimedia.org/r/543022

Change 543022 merged by Vgutierrez:
[operations/puppet@production] ATS: Enable reloading global lua script

https://gerrit.wikimedia.org/r/543022

Mentioned in SAL (#wikimedia-operations) [2019-10-22T06:32:00Z] <vgutierrez> rolling restart of ats-tls - T233274 T234803

Vgutierrez claimed this task.
Vgutierrez removed a project: Patch-For-Review.

Change 545522 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: do not pass enable-reload to tslua

https://gerrit.wikimedia.org/r/545522

The solution proposed in https://gerrit.wikimedia.org/r/543022 doesn't work as expected due to a bug on ATS. after a config reload the lua script loses the argtb

Change 545522 merged by Ema:
[operations/puppet@production] ATS: do not pass enable-reload to tslua

https://gerrit.wikimedia.org/r/545522

Mentioned in SAL (#wikimedia-operations) [2019-10-23T12:31:22Z] <vgutierrez> restarting ats-tls on cache text nodes - T233274

Change 552201 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: enable reload for global Lua script

https://gerrit.wikimedia.org/r/552201

Change 552201 merged by Ema:
[operations/puppet@production] ATS: enable reload for global Lua script

https://gerrit.wikimedia.org/r/552201

Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:03:03Z] <ema> cp1075: ats-backend-restart to enable lua reload T233274

Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:11:09Z] <ema> cp1075: ats-tls-restart to enable lua reload T233274

Mentioned in SAL (#wikimedia-operations) [2019-11-25T15:20:33Z] <ema> cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274

I've debugged locally what we seen yesterday on production with the following lua script:

WEBSOCKET_SUPPORT = nil
function __init__(argtb)
    dofile("/etc/trafficserver/lua/config.lua")
    ts.debug("On init WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
end
function __reload__()
    dofile("/etc/trafficserver/lua/config.lua")
    ts.debug("On reload WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))
end
function get_websocket_support()
    return WEBSOCKET_SUPPORT
end
function do_global_send_request()
    ts.debug("WEBSOCKET_SUPPORT = " .. tostring(WEBSOCKET_SUPPORT))

On ATS start we get the following debug log:

{0x2afa382f8080} DIAG: (ts_lua) On init WEBSOCKET_SUPPORT = true

On each request the following debug line gets logged:

{0x2afa44304700} DIAG: (ts_lua) WEBSOCKET_SUPPORT = true

The weird stuff starts after issuing traffic_ctl config reload:

{0x2afa46d90700} DIAG: (ts_lua) On reload WEBSOCKET_SUPPORT = true

So far so good, but now on each request the following debug line is logged:

{0x2afa44304700} DIAG: (ts_lua) WEBSOCKET_SUPPORT = nil

@Vgutierrez - I really think, reading the Lua plugin code, that __reload__ in 8.0.x might not do what you'd sanely expect (although it is undocumented). I think the __reload__ hook is actually more like a destructor hook for any custom destruct actions you want to happen before the reload into a fresh Lua context. And reload also definitely doesn't hit __init__ either, which leaves the whole model seeming a little broken if you've got a global initialized to nil in its declaration and then initialized to a real value in __init__. Clearly we're missing something here...

Change 552955 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: enable reload for global Lua script

https://gerrit.wikimedia.org/r/552955

I've debugged locally what we seen yesterday on production with the following lua script:

WEBSOCKET_SUPPORT = nil

Try adding ts.debug("WEBSOCKET_SUPPORT is set to " .. WEBSOCKET_SUPPORT) after this line and you'll see that the variable is indeed set to nil after __reload__ is called.

HOSTNAME = nil
ts.error("HOSTNAME set to nil")

function read_config()
        local confffile = ts.get_config_dir() .. "/default.lua.conf"
        ts.error("Loading HOSTNAME from " .. confffile)
        dofile(confffile)
        ts.error("HOSTNAME is " .. HOSTNAME)
        assert(HOSTNAME, "Cannot read " .. confffile)
end

function __init__(argtb)
        read_config()
end

function __reload__()
        read_config()
end

function do_global_send_response()
        ts.client_response.header['X-Cache-Int'] = 'gelato: '.. HOSTNAME
end

On startup I get:

[Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: HOSTNAME set to nil
[Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf
[Nov 26 06:43:17.542] {0x2b47b449e180} ERROR: HOSTNAME is banana

On reload:

[Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf
[Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: HOSTNAME is banana
[Nov 26 06:43:46.292] {0x2b47c2d68700} ERROR: HOSTNAME set to nil

Tried this on a test instance instead:

function read_config()
        local confffile = ts.get_config_dir() .. "/default.lua.conf"

        ts.error("Loading HOSTNAME from " .. confffile)
        dofile(confffile)

        assert(HOSTNAME, "Cannot read " .. confffile)
        ts.error("HOSTNAME is " .. HOSTNAME)

        return HOSTNAME
end

local HOSTNAME = read_config()
ts.error("HOSTNAME now set to " .. HOSTNAME)

function do_global_send_response()
        ts.client_response.header['X-Cache-Int'] = 'gelato: '.. HOSTNAME
end

On startup:

[Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf
[Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: HOSTNAME is banana potato
[Nov 26 06:58:19.807] {0x2b74fa91c180} ERROR: HOSTNAME now set to banana potato
$  curl -v http://localhost:8080 2>&1 | grep X-Cache-Int
< X-Cache-Int: gelato: banana potato

Changed default.lua.conf, issued a reload:

[Nov 26 07:00:34.184] {0x2b7525264700} ERROR: Loading HOSTNAME from /etc/trafficserver/default.lua.conf
[Nov 26 07:00:34.184] {0x2b7525264700} ERROR: HOSTNAME is banana hehe
[Nov 26 07:00:34.184] {0x2b7525264700} ERROR: HOSTNAME now set to banana hehe
$ curl -v http://localhost:8080 2>&1 | grep X-Cache-Int
< X-Cache-Int: gelato: banana hehe

Here's my proposed patch to add this behavior to production: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/552955/

Change 552955 merged by Ema:
[operations/puppet@production] ATS: enable reload for global Lua script

https://gerrit.wikimedia.org/r/552955

Mentioned in SAL (#wikimedia-operations) [2019-11-27T13:28:16Z] <ema> cp1075: ats-{tls,backend} restarted to apply tslua reload changes T233274

Mentioned in SAL (#wikimedia-operations) [2019-11-27T13:42:25Z] <ema> cp1075: repool with tslua reloads enabled T233274

Change 553344 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: use ts.debug instead of error in global Lua scripts

https://gerrit.wikimedia.org/r/553344

Change 553344 merged by Ema:
[operations/puppet@production] ATS: use ts.debug instead of error in global Lua scripts

https://gerrit.wikimedia.org/r/553344

Change 553345 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: ensure Lua conf Icinga checks produce output

https://gerrit.wikimedia.org/r/553345

Change 553345 merged by Ema:
[operations/puppet@production] ATS: ensure Lua conf Icinga checks produce output

https://gerrit.wikimedia.org/r/553345

Mentioned in SAL (#wikimedia-operations) [2019-11-27T15:04:00Z] <ema> cp-ats: rolling ats-{tls,backend} restart to enable lua reload T233274

Reloading global Lua scripts now works. Closing.