$ kubectl get pod -o wide --all-namespaces | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}' STATUS NODE NAME NAMESPACE CrashLoopBackOff tools-k8s-worker-29 blubber-5b5bd7879-jq42t tool-blubber CrashLoopBackOff tools-k8s-worker-6 catscan2-c74c8975c-sdbnh tool-catscan2 Error tools-k8s-worker-29 cobot-6d7747d79d-s5t52 tool-cobot CrashLoopBackOff tools-k8s-worker-19 corhist-85c78bd645-zx4nv tool-corhist CrashLoopBackOff tools-k8s-worker-32 dtz-69df77765b-gzjzb tool-dtz CrashLoopBackOff tools-k8s-worker-41 farhangestan-7cf6f84ddf-s7r4s tool-farhangestan CrashLoopBackOff tools-k8s-worker-36 hroest2-df4c7c7f9-ccpzm tool-hroest2 CrashLoopBackOff tools-k8s-worker-51 p4teleirc-769f54c9d7-m9l9v tool-lziad-cvnbot CrashLoopBackOff tools-k8s-worker-46 p4wikibot-78c95c7c9f-gswn5 tool-lziad CrashLoopBackOff tools-k8s-worker-11 machtsinn-f4cd5d866-tb785 tool-machtsinn CrashLoopBackOff tools-k8s-worker-3 matthobot-755f8bdb85-dc5z5 tool-matthobot CrashLoopBackOff tools-k8s-worker-33 mw2sparql-765fd5b496-8wgc6 tool-mw2sparql CrashLoopBackOff tools-k8s-worker-6 quick-intersection-7bf484cb65-5ztdm tool-quick-intersection CrashLoopBackOff tools-k8s-worker-42 shextranslator-68b876b485-tr6pz tool-shextranslator CrashLoopBackOff tools-k8s-worker-53 wdumps-57d78b94fc-tchqg tool-wdumps CrashLoopBackOff tools-k8s-worker-46 webarchivebot-backend-784564bb-7n8hl tool-webarchivebot CrashLoopBackOff tools-k8s-worker-50 wikicite-dashboard-86f78c8ff6-tw87w tool-wikicite-dashboard
Description
Related Objects
- Mentioned In
- T246563: 'webarchivebot' tool's custom webarchivebot-backend crashing because of code that expects /usr/sbin/sendmail
T246562: 'wdumps' tool does not contain the expected www/python/src/app.py entrypoint script
T246560: "corhist" tool having problems running on 2020 Kubernetes cluster - Mentioned Here
- T246552: Develop periodic health report for Kubernetes cluster objects
T246563: 'webarchivebot' tool's custom webarchivebot-backend crashing because of code that expects /usr/sbin/sendmail
T246562: 'wdumps' tool does not contain the expected www/python/src/app.py entrypoint script
Event Timeline
Blubber was broken because of the bad logic for extra args handling in webservice 0.63. Fixing required a hard restart of the service. webservice restart did not trigger pulling the latest Docker container with the fixes in webservice 0.64.
$ kubectl logs blubber-5b5bd7879-jq42t usage: webservice-runner [-h] [--type {tomcat,nodejs,generic,lighttpd,lighttpd-plain,uwsgi-plain,uwsgi-python}] [--port PORT] [--register-proxy] [extra_args [extra_args ...]] webservice-runner: error: unrecognized arguments: --extra_args $ webservice stop DEPRECATED: 'golang' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Stopping webservice $ webservice --backend=kubernetes golang start -- /data/project/blubber/start DEPRECATED: 'golang' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Starting webservice............... $ kubectl get po NAME READY STATUS RESTARTS AGE blubber-6fdc657567-v5xj4 1/1 Running 0 51s
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:14:25Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 (T246559)
Catscan2 is crashing because of it's $HOME/.lighttpd.conf contents.
$ kubectl logs catscan2-c74c8975c-sdbnh Duplicate array-key '.php' 2020-03-01 05:25:05: (configfile.c.1289) source: /var/run/lighttpd/catscan2 line: 647 pos: 1 parser failed somehow near here: (EOL)
# .lighttpd.conf server.modules +=("mod_auth","mod_status") status.status-url = "/catscan2/server-status" status.statistics-url = "/catscan2/server-statistics" server.max-keep-alive-requests = 64 # deny access for baidu, Yahoo, yandex, tweetmemebot etc. $HTTP["useragent"] =~ "(bingbot|Slurp|Baiduspider|Yahoo|yandex\.com/bots|TweetmemeBot|ProoXiBot|SeznamBot|AhrefsBot|Googlebot|YandexBot)" { url.access-deny = ( "" ) } url.redirect = ( "^.*catscan2.php(.*)" => "https://petscan.wmflabs.org/$1" ) $HTTP["querystring"] =~ "(basecat|commons.http)" { url.access-deny = ( "" ) } # 2020-02-29 bd808: causing lighttpd startup crashes (T246559) #fastcgi.server += ( ".php" => # (( # "bin-path" => "/usr/bin/php-cgi", # "socket" => "/tmp/php.socket.catscan2", # "max-procs" => 2, # "bin-environment" => ( # "PHP_FCGI_CHILDREN" => "3", # "PHP_FCGI_MAX_REQUESTS" => "500" # ), # "bin-copy-environment" => ( # "PATH", "SHELL", "USER" # ), # "broken-scriptfilename" => "enable", # "allow-x-send-file" => "enable" # )) #)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:37:30Z] <wm-bot> <root> Editied /data/project/catscan2/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)
Cobot has a python 2.7 venv, but is trying to run on the python image which is actually python 3.4.
$ tail uwsgi.log Fatal Python error: Py_Initialize: Unable to get the locale encoding ImportError: No module named 'encodings' !!! uWSGI process 1 got Segmentation Fault !!! *** backtrace of 1 *** /usr/bin/uwsgi(uwsgi_backtrace+0x30) [0x4635f0] /usr/bin/uwsgi(uwsgi_segfault+0x21) [0x4639b1] /lib/x86_64-linux-gnu/libc.so.6(+0x350e0) [0x7f7cb04370e0] /lib/x86_64-linux-gnu/libc.so.6(abort+0x232) [0x7f7cb0438532] /usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(Py_FatalError+0x50) [0x7f7caec6cbe0] /usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(+0x14462b) [0x7f7caec6d62b] /usr/lib/x86_64-linux-gnu/libpython3.4m.so.1.0(_Py_InitializeEx_Private+0x23a) [0x7f7caec6dd9a] /usr/lib/uwsgi/plugins/python3_plugin.so(uwsgi_python_init+0x101) [0x7f7caf37b2a1] /usr/bin/uwsgi(uwsgi_start+0x51e) [0x464bee] /usr/bin/uwsgi(uwsgi_setup+0x1073) [0x466c33] /usr/bin/uwsgi(main+0x9) [0x416999] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f7cb0423b45] /usr/bin/uwsgi() [0x4169c7] *** end of backtrace ***
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:43:07Z] <wm-bot> <root> Restarted as a python2 webservice to match the existing www/python/venv. (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T05:46:15Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)
This is the corhist tool. I tried:
$ webservice --backend=kubernetes jdk8 start -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -jar corhist-0.1-SNAPSHOT.jar DEPRECATED: 'jdk8' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Starting webservice...............
This is apparently still not working properly with webservice 0.64 webservice-runner inside the container:
$ kubectl logs corhist-ffbc9d659-s5nmv usage: webservice-runner [-h] [--type {tomcat,nodejs,generic,lighttpd,lighttpd-plain,uwsgi-plain,uwsgi-python}] [--port PORT] [--register-proxy] [extra_args [extra_args ...]] webservice-runner: error: unrecognized arguments: -jar corhist-0.1-SNAPSHOT.jar
I'm pretty sure the problem is in the way that webservice-runner receives the args. As a workaround I have made a tiny shell script to pass to webservice-runner instead:
#!/bin/sh exec /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -jar corhist-0.1-SNAPSHOT.jar
$ webservice --backend=kubernetes jdk8 start -- $HOME/run-inside-container.sh DEPRECATED: 'jdk8' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Starting webservice...
After doing this, the container and its inner jvm starts, but it eventually crashes due to an OOM:
$ kubectl logs -f corhist-6b5df75499-hbbwp 2020-03-01T05:55:53.278+0000 SEVERE could not reopen database org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space at org.hsqldb.error.Error.error(Unknown Source) at org.hsqldb.Database.reopen(Unknown Source) at org.hsqldb.Database.open(Unknown Source) at org.hsqldb.DatabaseManager.getDatabase(Unknown Source) at org.hsqldb.DatabaseManager.newSession(Unknown Source) at org.hsqldb.jdbc.JDBCConnection.<init>(Unknown Source) at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source) at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at org.wikidata.history.game.ViolationDatabase.<init>(ViolationDatabase.java:31) at org.wikidata.history.game.Main.main(Main.java:25) Caused by: java.lang.OutOfMemoryError: Java heap space Exception in thread "main" java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) at org.hsqldb.jdbc.JDBCConnection.<init>(Unknown Source) at org.hsqldb.jdbc.JDBCDriver.getConnection(Unknown Source) at org.hsqldb.jdbc.JDBCDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:664) at java.sql.DriverManager.getConnection(DriverManager.java:247) at org.wikidata.history.game.ViolationDatabase.<init>(ViolationDatabase.java:31) at org.wikidata.history.game.Main.main(Main.java:25) Caused by: org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space at org.hsqldb.error.Error.error(Unknown Source) at org.hsqldb.Database.reopen(Unknown Source) at org.hsqldb.Database.open(Unknown Source) at org.hsqldb.DatabaseManager.getDatabase(Unknown Source) at org.hsqldb.DatabaseManager.newSession(Unknown Source) ... 7 more Caused by: java.lang.OutOfMemoryError: Java heap space
The container is getting small default limits for cpu and ram. Let's try raising them up quite a bit:
$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- $HOME/run-inside-container.sh
This gets things a lot closer to running, but there are other issues with the tool. I will open a separate bug for the maintainer to investigate.
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:15:36Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:18:27Z] <wm-bot> <root> Restarted webservice as 'python' type (py3.4) rather than 'python3.5' to match existing www/python/venv (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:23:14Z] <wm-bot> <root> Tool stuck in CrashLoopBackOff because python 2.7 venv did not contain the flask library. (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:26:48Z] <wm-bot> <root> Restarted with python3.5 runtime to match existing www/python/venv version (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T06:28:34Z] <wm-bot> <root> Stopped python webservice because there are no python application files in www/python/src. (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:32:00Z] <wm-bot> <root> Hard restart of webservice to fix startup crash caused by a webservice v0.63 bug (T246559)
tools.mw2sparql@tools-sgebastion-08:~$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- /usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java -Xmx2G -Xms2G -Djbo.maxpoolsize=0 -Dlogback.configurationFile=logback.xml -jar mwontop-0.2-SNAPSHOT.jar DEPRECATED: 'jdk8' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Starting webservice...
This runs into the same webservice-runner argument parsing bug that we saw in T246559#5930151. I used the same basic fix:
$ webservice --backend=kubernetes --mem 2560Mi --cpu 500m jdk8 start -- /data/project/mw2sparql/run-inside-container.sh
$ kubectl logs quick-intersection-7bf484cb65-5ztdm Duplicate array-key '.php' 2020-03-01 07:30:17: (configfile.c.1289) source: /var/run/lighttpd/quick-intersection line: 649 pos: 1 parser failed somehow near here: (EOL)
Same problem as T246559#5930126.
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:35:59Z] <wm-bot> <root> Edited /data/project/quick-intersection/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:39:35Z] <wm-bot> <root> Restarted with python3.7 runtime to match existing www/python/venv version (T246559)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T07:51:04Z] <wm-bot> <root> Stopped tool because it is missing the required www/python/src/app.py entry point. This is causing uwsgi to crash. (T246559, T246562)
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T08:03:17Z] <wm-bot> <root> Deleted webarchivebot-backend deployment because code is not running properly on the 2020 Kubernetes cluster. (T246559, T246563)
$ kubectl logs wikicite-dashboard-86f78c8ff6-tw87w Duplicate config variable in conditional 0 global: server.errorlog 2020-03-01 08:04:17: (configfile.c.1154) source: /var/run/lighttpd/wikicite-dashboard line: 611 pos: 19 parser failed somehow near here: (EOL)
# 2020-03-01 bd808: causing lighttpd startup crashes (T246559) #accesslog.filename = "{home}/logs/access.log" #server.errorlog = "{home}/logs/error.log" #server.breakagelog = "{home}/logs/error.log" url.rewrite-once = ( ".*\.(js|css|png|map)" => "$0", "^/wikicite-dashboard(/.*)" => "/wikicite-dashboard/index.php$1" )
Mentioned in SAL (#wikimedia-cloud) [2020-03-01T08:10:14Z] <wm-bot> <root> Edited /data/project/wikicite-dashboard/.lighttpd.conf to remove fast-cgi config section that was preventing lighttpd from starting. (T246559)
At least for the moment all Pods are in a healthy state!
$ kubectl get pod -o wide --all-namespaces | awk '!/Running|Completed/ {n=split($8, host, "."); print $4, host[1], $2, $1}' STATUS NODE NAME NAMESPACE
Filed T246552: Develop periodic health report for Kubernetes cluster objects as a reminder to create some tooling to keep things in this state more often than not.