Page MenuHomePhabricator

linkrecommendation flap their readiness probes too often
Closed, ResolvedPublic

Description

While working on T277741 I noticed that linkrecommendation pods where switching a lot from fully ready to partially ready (2/3 containers only ready). Looking at events and pods I see the following

$kubectl describe pod/linkrecommendation-production-54968bf99-r5sqp

Warning Unhealthy 83s (x120 over 30m) kubelet, kubernetes1008.eqiad.wmnet Readiness probe failed: Get http://10.64.68.207:8000/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

So for some reason /healthz failes every now and then. Still not clear why, filing this task to investigate more later on.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
akosiaris added subscribers: kostajh, Tgr.

Adding @kostajh and @Tgr for their information.

Thanks for the heads up. No idea what's causing that, but I don't think it's specific to our application code. FWIW here is a stack trace of a call to /healthz:

{"written_at": "2021-03-23T11:46:50.547Z", "written_ts": 1616500010547968000, "type": "request", "correlation_id": "74961364-8bcd-11eb-a872-1e00d2265365", "remote_user": "-", "referer": "-", "x_forwarded_for": "-", "protocol": "HTTP/1.1", "method": "GET", "remote_ip": "127.0.0.1", "request_size_b": -1, "remote_host": "127.0.0.1", "remote_port": "60509", "request_received_at": "2021-03-23T11:46:50.547Z", "response_time_ms": 0, "response_status": 200, "response_size_b": 0, "response_content_type": "text/html; charset=utf-8", "response_sent_at": "2021-03-23T11:46:50.547Z", "url": "/healthz"}
--------------------------------------------------------------------------------
PATH: '/healthz'
         729 function calls (728 primitive calls) in 0.001 seconds

   Ordered by: cumulative time
   List reduced from 242 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/middleware/profiler.py:96(runapp)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2417(wsgi_app)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:1938(full_dispatch_request)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:1955(finalize_request)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2246(process_response)
        1    0.000    0.000    0.001    0.001 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:52(after_request)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1432(info)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1561(_log)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1587(handle)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1641(callHandlers)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:937(handle)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1069(emit)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:914(format)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:242(format)
        1    0.000    0.000    0.000    0.000 /Users/kostajh/src/research/mwaddlink/src/LogstashAwareJSONRequestLogFormatter.py:11(_format_log_object)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:261(_format_log_object)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2219(preprocess_request)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:47(before_request)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/util.py:123(get_correlation_id)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:355(push)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:204(__init__)
        3    0.000    0.000    0.000    0.000 {built-in method utcnow}
       16    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:344(__getattr__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2345(request_context)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/uuid.py:735(uuid1)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:285(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:345(match_request)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1754(match)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2029(make_response)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:2147(create_url_adapter)
       18    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:300(_get_current_object)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:690(__call__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1551(bind_to_environ)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1546(makeRecord)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:173(__init__)
       42    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:284(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:445(auto_pop)
       24    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:162(top)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:212(update_response_status)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:398(pop)
        4    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:80(get_http_header)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:22(<lambda>)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1647(update)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/json/__init__.py:183(dumps)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:673(get_wsgi_response)
        1    0.000    0.000    0.000    0.000 {built-in method _uuid.generate_time_safe}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/util.py:200(_get_correlation_id_in_request_header)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/json/encoder.py:182(encode)
       13    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/globals.py:35(_lookup_req_object)
        3    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/_internal.py:232(__get__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/app.py:1914(dispatch_request)
        6    0.000    0.000    0.000    0.000 {method 'sort' of 'list' objects}
        8    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/utils.py:85(__get__)
       28    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:70(__getattr__)
        1    0.000    0.000    0.000    0.000 {method 'write' of '_io.TextIOWrapper' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/middleware/profiler.py:92(catching_start_response)
        6    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/datastructures.py:1151(__contains__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/json/encoder.py:204(iterencode)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/gunicorn/http/wsgi.py:223(start_response)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/uuid.py:570(_load_system_functions)
    27/26    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:855(match)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/datastructures.py:1298(__setitem__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/gunicorn/http/wsgi.py:247(process_headers)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1487(bind)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:364(<lambda>)
        3    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/util.py:81(iso_time_format)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:123(get_response_size)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:223(push)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:341(set_data)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:551(get_wsgi_headers)
        5    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/globals.py:42(_lookup_app_object)
        1    0.000    0.000    0.000    0.000 {built-in method posix.uname}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:364(calculate_content_length)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:106(get_content_length)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_request.py:610(<lambda>)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/__init__.py:246(_format_log_object)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/datastructures.py:1212(set)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:149(pop)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/wrappers/base_response.py:644(get_app_iter)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1510(findCaller)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1674(__init__)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:112(get_remote_ip)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/local.py:141(push)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/flask/ctx.py:231(pop)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/uuid.py:132(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:74(get_remote_user)
        5    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1658(<lambda>)
        4    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/datastructures.py:1455(__getitem__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1685(isEnabledFor)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:126(get_content_type)
        3    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1628(_get_wsgi_string)
       56    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:94(get_correlation_id_in_request_context)
        2    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/datastructures.py:952(__init__)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/json_logging/framework/flask/__init__.py:109(get_method)
        5    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/site-packages/werkzeug/routing.py:1660(<lambda>)
        1    0.000    0.000    0.000    0.000 /opt/homebrew/Caskroom/miniforge/base/envs/mwaddlink/lib/python3.8/logging/__init__.py:1058(flush)

Gunicorn says its statsd integration using UDP is not supposed to block on slow consumers, and after a quick look in that direction, I don't think that would be interfering here either.

Change 674316 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[research/mwaddlink@main] Preload the app in gunicorn

https://gerrit.wikimedia.org/r/674316

This seems to be related to the increase of the number of workers. It was a 5x and of course it increased the memory usage of the pod. The increased memory usage, coupled with increase CPU usage once the memory limit is hit (due to heavy GC probably) is evident in grafana[1]

We could just bump the memory limit of the pod but it turns out the gunicorn was a nifty little flag[2] to preload the application before spawning the workers, allowing considerable memory savings. The drawback is that reloads aren't that easy anymore, but in our kubernetes environment were everything is a redeployment and restart of everything this is unimportant.

Patch at https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/674316

[1] https://grafana.wikimedia.org/d/CI6JRnLMz/linkrecommendation?viewPanel=127&orgId=1&var-dc=thanos&var-site=codfw&var-service=linkrecommendation&var-prometheus=k8s&var-container_name=All&from=now-30d&to=now
[2] https://docs.gunicorn.org/en/0.16.0/configure.html#preload-app

Change 674316 merged by jenkins-bot:
[research/mwaddlink@main] Preload the app in gunicorn

https://gerrit.wikimedia.org/r/674316

Change 674350 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/674350

Change 674350 merged by jenkins-bot:
[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/674350