Page MenuHomePhabricator

"ftl" tool's perl5.32 webservice pod being frequently killed due to liveness probe failures
Open, Needs TriagePublicBUG REPORT

Description

$ kubectl get pods
NAME                   READY   STATUS    RESTARTS       AGE
ftl-5646966575-rr2vc   1/1     Running   30 (19m ago)   6h20m
$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT                     MESSAGE
32m         Normal    Pulling     pod/ftl-5646966575-rr2vc   Pulling image "docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-web:latest"
22m         Warning   Unhealthy   pod/ftl-5646966575-rr2vc   Liveness probe failed: dial tcp 192.168.127.48:8000: i/o timeout
[16:57]  <JMarkOckerbloom>  My application runs on tools-sgebastion-10, and today (and part of yesterday) it's apparently been hanging a lot.  Recent error log record numerous restarts and kill with signal 9.  Anyone know what might be happening there (or if there have been any recent changes there)?
[17:04]  <    bd808> JMarkOckerbloom: what sort of process are you trying to run on the bastion that is leading to troubles?
[17:05]  <    bd808> I ask because there are things like Wheel of Misfortune that actively look for rogue processes to kill there
[17:12]  <JMarkOckerbloom> This is the ftl service (a CGI script implemented in perl).  It's been running for a number of years, but today and yesterday it seems to be hanging and getting restarted a fair bit.
[17:18]  <    bd808> JMarkOckerbloom: if it is running on a bastion for a number of years it has been in violation of Toolforge rules for several years. Webservices, bot jobs, etc should always run on Kubernetes (and formerly gird engine). The bastions are for humans to start distributed tasks and do light file editing.
[17:21]  <    bd808> It looks like ftl is actually running its webservice on Kubernetes
[17:21]  <    bd808> but it is restarting a lot...
[17:22]  <    bd808> I wonder if there is something about the new default health checks that is causing problems there?
[17:24]  <    bd808> the output in $HOME/error.log doesn't mean much to me, but it also doesn't look like crash logs
[17:26]  <    bd808> JMarkOckerbloom: this is probably worth a Phabricator task. The `kubectl get pod` output shows 10 restarts in the last 138 minutes so something is up for sure. 
[17:27]  <    bd808> I have to run to an IRL meeting, but I can poke around a bit later in my day to see if I can spot anything in particular.

Event Timeline

My initial hunch was that the liveness probe was interacting badly with this lighttpd config:

$HOME/.lighttpd.conf
$HTTP["url"] =~ "^/cgi-bin" {
        cgi.assign = ( "" => "" )
}

I don't think that is obviously what is happening however.

I am now instead wondering about the very high frequency of 4xx status responses:

Screenshot 2024-04-02 at 17.40.15.png (1×1 px, 362 KB)

The memory usage graph for the tool looks very suspiciously truncated at the default 512MiB default allocation limit with a period that looks to align with the 4xx failure spikes:

Screenshot 2024-04-02 at 17.44.10.png (1×1 px, 221 KB)

The cpu usage graph shows a very similar clipped ceiling corresponding with the 0.5CPU default limit.

$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT                     MESSAGE
42m         Normal    Pulling     pod/ftl-5646966575-rr2vc   Pulling image "docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-web:latest"
42m         Normal    Created     pod/ftl-5646966575-rr2vc   Created container webservice
42m         Normal    Started     pod/ftl-5646966575-rr2vc   Started container webservice
31s         Warning   Unhealthy   pod/ftl-5646966575-rr2vc   Liveness probe failed: dial tcp 192.168.127.48:8000: i/o timeout
31s         Normal    Killing     pod/ftl-5646966575-rr2vc   Container webservice failed liveness probe, will be restarted
42m         Normal    Pulled      pod/ftl-5646966575-rr2vc   Successfully pulled image "docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-web:latest" in 96.636575ms
$ date
Tue 02 Apr 2024 11:29:03 PM UTC

When the pod is live enough at the point of shutdown to log things, it is logging a lot about active CGI processes being killed:

$ tail -100 error.log
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 2915 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 2933 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 2957 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 2969 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 2974 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 3001 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 3022 died with signal 9
2024-04-02 22:41:29: mod_cgi.c.892) CGI pid 3025 died with signal 9
2024-04-02 22:41:29: server.c.1976) server stopped by UID = 0 PID = 0
2024-04-02 22:46:21: server.c.1513) server started (lighttpd/1.4.59)
2024-04-02 22:48:07: server.c.1513) server started (lighttpd/1.4.59)
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 44 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 56 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 59 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 63 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 75 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 82 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 83 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 92 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 96 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 105 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 106 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 109 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 120 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 126 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 139 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 142 died with signal 9
2024-04-02 22:51:21: mod_cgi.c.892) CGI pid 148 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 113 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 116 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 119 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 122 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 124 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 127 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 132 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 136 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 141 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 143 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 145 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 146 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 149 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 150 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 151 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 152 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 153 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 154 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 155 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 156 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 159 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 160 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 161 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 162 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 163 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 164 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 166 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 167 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 168 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 169 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 170 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 171 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 172 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 173 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 174 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 175 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 176 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 177 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 178 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 179 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 181 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 182 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 183 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 184 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 185 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 187 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 188 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 189 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 190 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 193 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 194 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 195 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 196 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 199 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 203 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 206 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 208 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 209 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 210 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 212 died with signal 9
2024-04-02 22:56:21: mod_cgi.c.892) CGI pid 213 died with signal 9
2024-04-02 22:59:49: server.c.1513) server started (lighttpd/1.4.59)
2024-04-02 23:29:03: server.c.1513) server started (lighttpd/1.4.59)

Partially just to see what happens and partially in the hope that adding a few more resources will make a positive difference, I have created this service.template and restarted the service to pick up the new limits:

$HOME/service.template
backend: kubernetes
type: perl5.32
mem: 1G
cpu: 1
health-check-path: /index.html

Mentioned in SAL (#wikimedia-cloud) [2024-04-02T23:56:23Z] <wmbot~bd808@tools-sgebastion-10> Restarted webservice to pick up new service.template defined RAM and CPU limits. (T361652)

tools.ftl@tools-sgebastion-10:~$ kubectl get events
LAST SEEN   TYPE      REASON      OBJECT                     MESSAGE
57m         Normal    Pulling     pod/ftl-6fd89b956c-64txz   Pulling image "docker-registry.tools.wmflabs.org/toolforge-perl532-sssd-web:latest"
2m1s        Warning   Unhealthy   pod/ftl-6fd89b956c-64txz   Liveness probe failed: Get "http://192.168.127.46:8000/index.html": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
16m         Normal    Killing     pod/ftl-6fd89b956c-64txz   Container webservice failed liveness probe, will be restarted
tools.ftl@tools-sgebastion-10:~$ kubectl get po
NAME                   READY   STATUS             RESTARTS        AGE
ftl-6fd89b956c-64txz   0/1     CrashLoopBackOff   40 (2m4s ago)   17h
tools.ftl@tools-sgebastion-10:~$ tail error.log
Don't know what to do with 3639 in ??????????????อก
that spells as 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 3639, 3629, 3585
Wide character in print at /data/project/ftl/lib/OLBP/Entities.pm line 434.
Don't know what to do with 3629 in ???????????????ก
that spells as 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 3629, 3585
Don't know what to do with 3585 in ????????????????
that spells as 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 3585
2024-04-03 17:08:49: server.c.1976) server stopped by UID = 0 PID = 0
2024-04-03 17:09:06: server.c.1513) server started (lighttpd/1.4.59)
2024-04-03 17:13:06: server.c.1513) server started (lighttpd/1.4.59)
tools.ftl@tools-sgebastion-10:~$ kubectl get po
NAME                   READY   STATUS    RESTARTS        AGE
ftl-6fd89b956c-64txz   1/1     Running   41 (4m9s ago)   17h

Just doubling RAM and CPU at the problem did not make it go away.

The CPU utilization graph actually looks better now, but memory is still hitting the hard limit. I am a bit confused about where all that RAM is going. The app has a few 50-70MiB data files that it might read to handle a particular request, but ultimately it looks like that is most of what it does: make a few static data file lookups and issue an HTTP redirect to the appropriate library catalog system. The tool does receive spiky traffic with peaks of ~20 requests per second, but these seem like fairly boring numbers?

Mentioned in SAL (#wikimedia-cloud) [2024-04-03T17:42:22Z] <wmbot~bd808@tools-sgebastion-10> Restarted webservice to pick up new service.template defined 2G RAM limit. (T361652)

Zooming out to look at the last 30 days of http status data, I'm actually wondering now if the problem is just horizontal scaling of this Perl CGI design. It looks like traffic to the tool has taken off pretty wildly in the last 10 days or so and most of that traffic growth has generated 4xx errors.

Screenshot 2024-04-03 at 14.49.32.png (1×1 px, 330 KB)

Because we can see other artifacts of resource limits kicking in related graphs my first thoughts are about finding a more performant runtime than forked CGIs launched by lighttpd. I don't have any specific dislike of Perl or lighttpd, but I do know that this is a combination along with CGI that will have some of the worst request handling startup times of anything running as a Toolforge webservice.

In our pre-built PHP containers we setup a fastcgi.server config block using /usr/bin/php-cgi as the fcgi container. This helps even out/eliminate the startup costs of the PHP binary for each request. There are a couple of different Perl fcgi recipes shown on https://redmine.lighttpd.net/projects/lighttpd/wiki/ApplicationsUsingLighttpd that use CGI::Fast to create a long running dispatch script for Perl requests. It might be worth trying out one or both of these ideas to see if request handling can be made to scale a little better.