Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Mar 22 2017, 11:41 AM

Description

We have added lua support to tlsproxy, and left it disabled by default. Based on nginx-lua, we allowed to expose nginx metrics for prometheus such as request latency and response status.

The performance impact of those features needs to be evaluated before enabling them in production.

Details

	Subject	Repo	Branch	Lines +/-
	tlsproxy: simplify prometheus metrics gathering	operations/puppet	production	+26 -452

Customize query in gerrit

Event Timeline

• ema created this task.Mar 22 2017, 11:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2017, 11:41 AM

• ema triaged this task as Medium priority.Mar 22 2017, 11:41 AM

Methodology

We have benchmarked nginx performance on pinkunicorn (cp1008). An additional nginx location block has been added to ensure that the benchmark actually tested the performance of nginx rather than of other parts of the stack:

# /etc/nginx/sites-enabled/unified
location = /lua_bench {
    return 200;
}

hey has been used as an HTTP load generator with the following options:

# -c concurrent requests, -n number of requests
hey -cpus 8 -c 200 -n 50000 https://localhost/lua_bench

Four different nginx configurations have been tested:

no-lua	nginx lua module disabled
lua-no-prometheus	nginx lua module enabled, prometheus metrics disabled
prometheus-no-latency	nginx lua module and prometheus metrics enabled, status code only
prometheus-full	nginx lua module and prometheus metrics enabled, status code and latency

For each configuration we have performed 200 runs.

Results

Number of requests per second served for each configuration.

Configuration	Min	Median	Max	Standard Deviation
no-lua	21986.22	29382.30	32182.19	1091.46
lua-no-prometheus	21305.12	30003.09	33827.59	2186.90
prometheus-no-latency	19461.72	25853.65	27188.88	1106.30
prometheus-full	15642.42	17847.48	18563.96	468.22

Conclusions

The nginx-lua module can be enabled immediately as it introduces no slowdowns in nginx . Prometheus metrics support, as it is, adds significant performance penalties and should not be enabled. However, it is encouraging that reducing the scope to response status only (thus excluding latency measurements) also reduces the penalty introduced. We should tune the lua script and see if it is possible to reduce the slowdown to acceptable levels.

• ema moved this task from Backlog to TLS on the Traffic board.Mar 27 2017, 9:28 AM

nginx-lua-prometheus uses a dictionary in shared memory that gets updated by the nginx worker processes.

I've tried two additional approaches to see if contention could be a factor in decreasing performance: i) reduce lua code to a minimum while still using shared memory ii) keep stats on a per-worker basis and only update the shared data structure twice a second through ngx.timer.at.

i) simple approach with shared dictionary

lua_shared_dict global_count 5M;

log_by_lua_block {
    local global_count = ngx.shared.global_count
    local newval, err = global_count:incr(status, 1)
    if not newval and err == "not found" then
        global_count:set(ngx.var.status, 1)
    end
}

ii) local updates

lua_shared_dict global_count 5M;

init_worker_by_lua_block {
    local worker_status_codes = require("prometheus")
    worker_status_codes.update_global_count()
}
    
    
log_by_lua_block {
    -- Increment per-worker stats.
    local worker_status_codes = require("prometheus")
    worker_status_codes.update_local_count(ngx.var.status, 1)
}

-- /etc/nginx/lua/prometheus.lua
-- See https://github.com/openresty/lua-nginx-module/#data-sharing-within-an-nginx-worker
local _M = {}

local data = {}

function _M.update_local_count(status, value)
    if data[status] then
        data[status] = data[status] + value
    else
        data[status] = value
    end
end

function _M.update_global_count()
    local global_count = ngx.shared.global_count

    -- data contains per-worker status codes
    for key, value in pairs(data) do
        -- incr and set are operations on global_count which is shared among nginx worker processes
        local newval, err = global_count:incr(key, value)
        if not newval and err == "not found" then
            global_count:set(key, value)
        end
    end

    -- reset local counter
    data = {}

    -- schedule next global counter update in 0.5s
    ngx.timer.at(0.5, _M.update_global_count)
end

return _M

Performance-wise, there doesn't seem to be any significant advantage in following the more complicated approach:

Configuration	Min	Median	Max	Standard Deviation
shared-dict	21009.09	28735.77	31436.54	1389.97
local-updates	21965.24	28329.30	31815.98	1139.40

Change 345123 had a related patch set uploaded (by Ema):
[operations/puppet@production] tlsproxy: simplify prometheus metrics gathering

https://gerrit.wikimedia.org/r/345123

gerritbot added a project: Patch-For-Review.Mar 28 2017, 11:10 AM

Change 345123 merged by Ema:
[operations/puppet@production] tlsproxy: simplify prometheus metrics gathering