Page MenuHomePhabricator

Investigate CrashLoopBackoff Pods on 2020 Kubernetes cluster
Closed, ResolvedPublic

Description

$ kubectl get pod -o wide --all-namespaces | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}'
STATUS NODE NAME NAMESPACE
CrashLoopBackOff tools-k8s-worker-29 blubber-5b5bd7879-jq42t tool-blubber
CrashLoopBackOff tools-k8s-worker-6 catscan2-c74c8975c-sdbnh tool-catscan2
Error tools-k8s-worker-29 cobot-6d7747d79d-s5t52 tool-cobot
CrashLoopBackOff tools-k8s-worker-19 corhist-85c78bd645-zx4nv tool-corhist
CrashLoopBackOff tools-k8s-worker-32 dtz-69df77765b-gzjzb tool-dtz
CrashLoopBackOff tools-k8s-worker-41 farhangestan-7cf6f84ddf-s7r4s tool-farhangestan
CrashLoopBackOff tools-k8s-worker-36 hroest2-df4c7c7f9-ccpzm tool-hroest2
CrashLoopBackOff tools-k8s-worker-51 p4teleirc-769f54c9d7-m9l9v tool-lziad-cvnbot
CrashLoopBackOff tools-k8s-worker-46 p4wikibot-78c95c7c9f-gswn5 tool-lziad
CrashLoopBackOff tools-k8s-worker-11 machtsinn-f4cd5d866-tb785 tool-machtsinn
CrashLoopBackOff tools-k8s-worker-3 matthobot-755f8bdb85-dc5z5 tool-matthobot
CrashLoopBackOff tools-k8s-worker-33 mw2sparql-765fd5b496-8wgc6 tool-mw2sparql
CrashLoopBackOff tools-k8s-worker-6 quick-intersection-7bf484cb65-5ztdm tool-quick-intersection
CrashLoopBackOff tools-k8s-worker-42 shextranslator-68b876b485-tr6pz tool-shextranslator
CrashLoopBackOff tools-k8s-worker-53 wdumps-57d78b94fc-tchqg tool-wdumps
CrashLoopBackOff tools-k8s-worker-46 webarchivebot-backend-784564bb-7n8hl tool-webarchivebot
CrashLoopBackOff tools-k8s-worker-50 wikicite-dashboard-86f78c8ff6-tw87w tool-wikicite-dashboard

Event Timeline

bd808 created this task.Mar 1 2020, 1:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 1 2020, 1:23 AM
bd808 added a comment.Mar 1 2020, 5:13 AM

Blubber was broken because of the bad logic for extra args handling in webservice 0.63. Fixing required a hard restart of the service. webservice restart did not trigger pulling the latest Docker container with the fixes in webservice 0.64.

$ kubectl logs blubber-5b5bd7879-jq42t
usage: webservice-runner [-h]
                         [--type {tomcat,nodejs,generic,lighttpd,lighttpd-plain,uwsgi-plain,uwsgi-python}]
                         [--port PORT] [--register-proxy]
                         [extra_args [extra_args ...]]
webservice-runner: error: unrecognized arguments: --extra_args
$ webservice stop
DEPRECATED: 'golang' type is deprecated.
  See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
  for currently supported types.
Stopping webservice
$ webservice --backend=kubernetes golang start -- /data/project/blubber/start
DEPRECATED: 'golang' type is deprecated.
  See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
  for currently supported types.
Starting webservice...............
$ kubectl get po
NAME                       READY   STATUS    RESTARTS   AGE
blubber-6fdc657567-v5xj4   1/1     Running   0          51s

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:14:25Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 (T246559)

I've managed to fix both lziad and lziad-cvnbot. Sorry for any trouble caused.

bd808 added a comment.Mar 1 2020, 5:36 AM

Catscan2 is crashing because of it's $HOME/.lighttpd.conf contents.

$ kubectl logs catscan2-c74c8975c-sdbnh
Duplicate array-key '.php'
2020-03-01 05:25:05: (configfile.c.1289) source: /var/run/lighttpd/catscan2 line: 647 pos: 1 parser failed somehow near here: (EOL)
.lighttpd.conf
# .lighttpd.conf

server.modules +=("mod_auth","mod_status")
status.status-url = "/catscan2/server-status"
status.statistics-url = "/catscan2/server-statistics"

server.max-keep-alive-requests = 64

# deny access for baidu, Yahoo, yandex, tweetmemebot etc.
$HTTP["useragent"] =~ "(bingbot|Slurp|Baiduspider|Yahoo|yandex\.com/bots|TweetmemeBot|ProoXiBot|SeznamBot|AhrefsBot|Googlebot|YandexBot)" {
  url.access-deny = ( "" )
}

url.redirect = (
        "^.*catscan2.php(.*)" => "https://petscan.wmflabs.org/$1"
)

$HTTP["querystring"] =~ "(basecat|commons.http)" {
  url.access-deny = ( "" )
}

# 2020-02-29 bd808: causing lighttpd startup crashes (T246559)
#fastcgi.server += ( ".php" =>
#        ((
#                "bin-path" => "/usr/bin/php-cgi",
#                "socket" => "/tmp/php.socket.catscan2",
#                "max-procs" => 2,
#                "bin-environment" => (
#                        "PHP_FCGI_CHILDREN" => "3",
#                        "PHP_FCGI_MAX_REQUESTS" => "500"
#                ),
#                "bin-copy-environment" => (
#                        "PATH", "SHELL", "USER"
#                ),
#                "broken-scriptfilename" => "enable",
#                "allow-x-send-file" => "enable"
#         ))
#)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:37:30Z] <wm-bot> <root> Editied /data/project/catscan2/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)

bd808 added a comment.Mar 1 2020, 5:41 AM

Cobot has a python 2.7 venv, but is trying to run on the python image which is actually python 3.4.

$ tail uwsgi.log
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'
!!! uWSGI process 1 got Segmentation Fault !!!
*** backtrace of 1 ***
/usr/bin/uwsgi(uwsgi_backtrace+0x30) [0x4635f0]
/usr/bin/uwsgi(uwsgi_segfault+0x21) [0x4639b1]
/lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7f7cb04370e0]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x232) [0x7f7cb0438532]
/usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(Py_FatalError+0x50) [0x7f7caec6cbe0]
/usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(+0x14462b) [0x7f7caec6d62b]
/usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(_Py_InitializeEx_Private+0x23a) [0x7f7caec6dd9a]
/usr/lib/uwsgi/plugins/python3_plugin.so(uwsgi_python_init+0x101) [0x7f7caf37b2a1]
/usr/bin/uwsgi(uwsgi_start+0x51e) [0x464bee]
/usr/bin/uwsgi(uwsgi_setup+0x1073) [0x466c33]
/usr/bin/uwsgi(main+0x9) [0x416999]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f7cb0423b45]
/usr/bin/uwsgi() [0x4169c7]
*** end of backtrace ***

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:43:07Z] <wm-bot> <root> Restarted as a python2 webservice to match the existing www/python/venv. (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:46:15Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)

bd808 added a comment.Mar 1 2020, 5:48 AM

I've managed to fix both lziad and lziad-cvnbot. Sorry for any trouble caused.

No trouble at all. I'm glad you tracked the problem down!

bd808 added a comment.EditedMar 1 2020, 6:06 AM

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:46:15Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)

This is the corhist tool. I tried:

$ webservice --backend=kubernetes jdk8 start -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -jar corhist-0.1-SNAPSHOT.jar
DEPRECATED: 'jdk8' type is deprecated.
  See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
  for currently supported types.
Starting webservice...............

This is apparently still not working properly with webservice 0.64 webservice-runner inside the container:

$ kubectl logs corhist-ffbc9d659-s5nmv
usage: webservice-runner [-h]
                         [--type {tomcat,nodejs,generic,lighttpd,lighttpd-plain,uwsgi-plain,uwsgi-python}]
                         [--port PORT] [--register-proxy]
                         [extra_args [extra_args ...]]
webservice-runner: error: unrecognized arguments: -jar corhist-0.1-SNAPSHOT.jar

I'm pretty sure the problem is in the way that webservice-runner receives the args. As a workaround I have made a tiny shell script to pass to webservice-runner instead:

run-inside-container.sh
#!/bin/sh
exec /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -jar corhist-0.1-SNAPSHOT.jar
$ webservice --backend=kubernetes jdk8 start -- $HOME/run-inside-container.sh
DEPRECATED: 'jdk8' type is deprecated.
  See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
  for currently supported types.
Starting webservice...

After doing this, the container and its inner jvm starts, but it eventually crashes due to an OOM:

$ kubectl logs -f corhist-6b5df75499-hbbwp
2020-03-01T05:55:53.278+0000  SEVERE  could not reopen database
org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.Database.reopen(Unknown Source)
        at org.hsqldb.Database.open(Unknown Source)
        at org.hsqldb.DatabaseManager.getDatabase(Unknown Source)
        at org.hsqldb.DatabaseManager.newSession(Unknown Source)
        at org.hsqldb.jdbc.JDBCConnection.<init>(Unknown Source)
        at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source)
        at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at org.wikidata.history.game.ViolationDatabase.<init>(ViolationDatabase.java:31)
        at org.wikidata.history.game.Main.main(Main.java:25)
Caused by: java.lang.OutOfMemoryError: Java heap space

Exception in thread "main" java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCConnection.<init>(Unknown Source)
        at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source)
        at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source)
        at java.sql.DriverManager.getConnection(DriverManager.java:664)
        at java.sql.DriverManager.getConnection(DriverManager.java:247)
        at org.wikidata.history.game.ViolationDatabase.<init>(ViolationDatabase.java:31)
        at org.wikidata.history.game.Main.main(Main.java:25)
Caused by: org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.Database.reopen(Unknown Source)
        at org.hsqldb.Database.open(Unknown Source)
        at org.hsqldb.DatabaseManager.getDatabase(Unknown Source)
        at org.hsqldb.DatabaseManager.newSession(Unknown Source)
        ... 7 more
Caused by: java.lang.OutOfMemoryError: Java heap space

The container is getting small default limits for cpu and ram. Let's try raising them up quite a bit:

$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- $HOME/run-inside-container.sh

This gets things a lot closer to running, but there are other issues with the tool. I will open a separate bug for the maintainer to investigate.

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:15:36Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:18:27Z] <wm-bot> <root> Restarted webservice as 'python' type (py3.4) rather than 'python3.5' to match existing www/python/venv (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:23:14Z] <wm-bot> <root> Tool stuck in CrashLoopBackOff because python 2.7 venv did not contain the flask library. (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:26:48Z] <wm-bot> <root> Restarted with python3.5 runtime to match existing www/python/venv version (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:28:34Z] <wm-bot> <root> Stopped python webservice because there are no python application files in www/python/src. (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:32:00Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)

bd808 added a comment.Mar 1 2020, 7:32 AM

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:32:00Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)

tools.mw2sparql@tools-sgebastion-08:~$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -Xmx2G -Xms2G -Djbo.maxpoolsize=0 -Dlogback.configurationFile=logback.xml -jar mwontop-0.2-SNAPSHOT.jar
DEPRECATED: 'jdk8' type is deprecated.
  See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes
  for currently supported types.
Starting webservice...

This runs into the same webservice-runner argument parsing bug that we saw in T246559#5930151. I used the same basic fix:

$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- /data/project/mw2sparql/run-inside-container.sh
bd808 added a comment.Mar 1 2020, 7:35 AM
$ kubectl logs quick-intersection-7bf484cb65-5ztdm
Duplicate array-key '.php'
2020-03-01 07:30:17: (configfile.c.1289) source: /var/run/lighttpd/quick-intersection line: 649 pos: 1 parser failed somehow near here: (EOL)

Same problem as T246559#5930126.

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:35:59Z] <wm-bot> <root> Edited /data/project/quick-intersection/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:39:35Z] <wm-bot> <root> Restarted with python3.7 runtime to match existing www/python/venv version (T246559)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:51:04Z] <wm-bot> <root> Stopped tool because it is missing the required www/python/src/app.py entry point. This is causing uwsgi to crash. (T246559, T246562)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T08:03:17Z] <wm-bot> <root> Deleted webarchivebot-backend deployment because code is not running properly on the 2020 Kubernetes cluster. (T246559, T246563)

bd808 added a comment.Mar 1 2020, 8:09 AM
$ kubectl logs wikicite-dashboard-86f78c8ff6-tw87w
Duplicate config variable in conditional 0 global: server.errorlog
2020-03-01 08:04:17: (configfile.c.1154) source: /var/run/lighttpd/wikicite-dashboard line: 611 pos: 19 parser failed somehow near here: (EOL)
.lighttpd.conf
# 2020-03-01 bd808: causing lighttpd startup crashes (T246559)
#accesslog.filename = "{home}/logs/access.log"
#server.errorlog = "{home}/logs/error.log"
#server.breakagelog = "{home}/logs/error.log"

url.rewrite-once = (
    ".*\.(js|css|png|map)" => "$0",
    "^/wikicite-dashboard(/.*)" => "/wikicite-dashboard/index.php$1"
)

Mentioned in SAL (#wikimedia-cloud) [2020-03-01T08:10:14Z] <wm-bot> <root> Edited /data/project/wikicite-dashboard/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)

bd808 closed this task as Resolved.Mar 1 2020, 8:12 AM

At least for the moment all Pods are in a healthy state!

$ kubectl get pod -o wide --all-namespaces | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}'
STATUS NODE NAME NAMESPACE

Filed T246552: Develop periodic health report for Kubernetes cluster objects as a reminder to create some tooling to keep things in this state more often than not.