Page MenuHomePhabricator

Failure of "kubectl get pods" for the editgroups project
Closed, ResolvedPublic

Description

When running kubectl get pods from the editgroups project on bastion, I get the following:

tools.editgroups@tools-sgebastion-07:~$ kubectl get pods
runtime: failed to create new OS thread (have 12 already; errno=11)
fatal error: newosproc

runtime stack:
runtime.throw(0x2689df0, 0x9)
	/usr/local/go/src/runtime/panic.go:547 +0x90
runtime.newosproc(0xc8204a8800, 0xc8204b7fc0)
	/usr/local/go/src/runtime/os1_linux.go:149 +0x18c
runtime.newm(0x2aa1df0, 0x0)
	/usr/local/go/src/runtime/proc.go:1516 +0x135
runtime.startTheWorldWithSema()
	/usr/local/go/src/runtime/proc.go:1041 +0x206
runtime.systemstack(0xc820001840)
	/usr/local/go/src/runtime/asm_amd64.s:291 +0x79
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1051

goroutine 12 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:245 fp=0xc82002cd10 sp=0xc82002cd08
runtime.gcMarkTermination()
	/usr/local/go/src/runtime/mgc.go:1260 +0x3a9 fp=0xc82002cf20 sp=0xc82002cd10
runtime.gcMarkDone()
	/usr/local/go/src/runtime/mgc.go:1146 +0x29e fp=0xc82002cf40 sp=0xc82002cf20
runtime.gcBgMarkWorker(0xc82001b500)
	/usr/local/go/src/runtime/mgc.go:1480 +0x488 fp=0xc82002cfb8 sp=0xc82002cf40
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc82002cfc0 sp=0xc82002cfb8
created by runtime.gcBgMarkStartWorkers
	/usr/local/go/src/runtime/mgc.go:1330 +0x92

goroutine 1 [runnable, locked to thread]:
regexp.(*bitState).reset(0xc8202afea0, 0x20, 0x2)
	/usr/local/go/src/regexp/backtrack.go:87 +0x101
regexp.(*machine).backtrack(0xc820362ea0, 0x7f138f260758, 0xc820362f78, 0x0, 0x20, 0x2, 0x0)
	/usr/local/go/src/regexp/backtrack.go:324 +0x106
regexp.(*Regexp).doExecute(0xc8201225a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc820247700, 0x20, 0x0, 0x2, ...)
	/usr/local/go/src/regexp/exec.go:449 +0x423
regexp.(*Regexp).replaceAll(0xc8201225a0, 0x0, 0x0, 0x0, 0xc820247700, 0x20, 0x2, 0xc820697d20, 0x0, 0x0, ...)
	/usr/local/go/src/regexp/regexp.go:501 +0x12d
regexp.(*Regexp).ReplaceAllString(0xc8201225a0, 0xc820247700, 0x20, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/regexp/regexp.go:460 +0x133
k8s.io/kubernetes/vendor/github.com/renstrom/dedent.Dedent(0xc820247700, 0x20, 0x0, 0x0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/renstrom/dedent/dedent.go:46 +0x239
k8s.io/kubernetes/pkg/kubectl/cmd.init()
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubectl/cmd/explain.go:43 +0xe60
k8s.io/kubernetes/cmd/kubectl/app.init()
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/app/kubectl.go:39 +0x45
main.init()
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubectl/kubectl.go:30 +0x40

goroutine 9 [chan receive]:
k8s.io/kubernetes/vendor/github.com/golang/glog.(*loggingT).flushDaemon(0x3556da0)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:879 +0x67
created by k8s.io/kubernetes/vendor/github.com/golang/glog.init.1
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/golang/glog/glog.go:410 +0x297

goroutine 35 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:116 +0x132
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
	/usr/local/go/src/os/signal/signal_unix.go:28 +0x37

The issue seems specific to this project as running the same command from other projects I have access to triggers no error.

Event Timeline

tools.editgroups@tools-sgebastion-07:~$ ps -U tools.editgroups
  PID TTY          TIME CMD
16795 pts/36   00:00:00 bash
27082 pts/45   00:00:00 bash
27119 pts/45   00:00:00 ps
31505 pts/36   00:00:00 webservice
31507 pts/36   00:00:10 kubectl

The pts/45 TTY processes are mine as I ran the check. The other 3 processes belong to pts/36:

$ ps -aef | grep pts/36
root     16789 28335  0 Mar23 pts/36   00:00:00 /usr/bin/sudo -niu tools.editgroups
tools.e+ 16795 16789  0 Mar23 pts/36   00:00:00 -bash
tools.e+ 27159 27082  0 02:27 pts/45   00:00:00 grep pts/36
pintoch  28335 28334  0 Mar22 pts/36   00:00:00 /bin/bash
tools.e+ 31505 16795  0 Mar23 pts/36   00:00:00 /usr/bin/python -Es /usr/local/bin/webservice --backend=kubernetes python shell
tools.e+ 31507 31505  0 Mar23 pts/36   00:00:10 /usr/bin/kubectl attach --tty --stdin interactive

There is a webservice shell process running as the tool:

pintoch  28334     1  0 Mar22 ?        00:00:03 SCREEN -S inspect
pintoch  28335 28334  0 Mar22 pts/36   00:00:00  \_ /bin/bash
root     16789 28335  0 Mar23 pts/36   00:00:00      \_ /usr/bin/sudo -niu tools.editgroups
tools.e+ 16795 16789  0 Mar23 pts/36   00:00:00          \_ -bash
tools.e+ 31505 16795  0 Mar23 pts/36   00:00:00              \_ /usr/bin/python -Es /usr/local/bin/webservice --backend=kubernetes python shell
tools.e+ 31507 31505  0 Mar23 pts/36   00:00:10                  \_ /usr/bin/kubectl attach --tty --stdin interactive

The next part is not obvious. For several years we have had cgroups restrictions on the Toolforge bastions to try an keep them responsive for use by everyone. On the old Trusty bastions we used something called cgred to apply limits after processes were launched. The new Debian Stretch bastions (login.tools.wmflabs.org, stretch-dev.tools.wmflabs.org) are able to use systemd to enforce these limits instead. This does a more effective job than the old system.

The current limits are:

[Slice]
# each user can use max this % of one CPU
CPUQuota=30%
# each user can run max this number of tasks/threads
TasksMax=100
# slow down procs if they use more than this memory
MemoryHigh=512M
# if more than this memory is used, OOM killer will step in
MemoryMax=1024M
# users can't use swap memory
MemorySwapMax=0
# do accounting but don't limit anything by now
IOAccounting=yes
IPAccounting=yes

In this case there are not enough resources left available in the cgroup to spawn the threads desired by the kubectl command (runtime: failed to create new OS thread (have 12 already; errno=11)).

$ sudo cat /proc/31507/cgroup
10:cpuset:/
9:devices:/user.slice
8:blkio:/user.slice/user-10394.slice
7:memory:/user.slice/user-10394.slice
6:freezer:/
5:net_cls,net_prio:/
4:perf_event:/
3:cpu,cpuacct:/user.slice/user-10394.slice
2:pids:/user.slice/user-10394.slice/session-22384.scope
1:name=systemd:/user.slice/user-10394.slice/session-22384.scope
0::/user.slice/user-10394.slice/session-22384.scope
$ cat /sys/fs/cgroup/pids/user.slice/user-10394.slice/pids.current
25

The currently resident processes seem to be consuming 25 of the available 100 tasks. Running kubectl get pods under gdb seems to show it wanting to create 12 threads. This seems far less than the theoretically remaining quota, but I can reproduce the original crash still when running as the tools.editgroups user, so I think I'm missing something in trying to determine the actual usage.

I was about to give up and tell you "sorry, do not try to run so many things at once", but a bit of searching led me to a trick that can be used to limit the parallel threads of a go binary, [[https://golang.org/pkg/runtime/|GOMAXPROCS]]. With this I can actually get the desired command to run to completion:

$ GOMAXPROCS=1 kubectl get pods
NAME                                      READY     STATUS    RESTARTS   AGE
editgroups-2937528593-1oxnz               1/1       Running   0          1d
editgroups.celery.sh-3671823902-za5bk     1/1       Running   0          2d
editgroups.listener.sh-1891604770-n2b3y   1/1       Running   0          8h
interactive                               1/1       Running   0          1d

I think there is still something I'm not getting right about following the cgroup accounting though.

Woaw, thanks for the very thorough analysis! I cannot reproduce this anymore. Thanks for the GOMAXPROCS trick! I will add it to the docs.

Pintoch claimed this task.

Just confirming that the bug has occurred today again and the proposed fix worked perfectly. Thanks again!

Running into this, GOMAXPROCS=1 worked at first, but I kept needing to adjust it higher and higher (GOMAXPROCS=2, etc.), and eventually no more dice.

So yes, please someone fix the cgroup accounting! :)

For the record, another resource control setting involved is /etc/security/limits.conf. There we have nproc = 100. It is controlled by the hiera key profile::toolforge::bastion::nproc.

@bd808 I propose we raise this limit in the -dev bastion and see if that makes any difference.

For the record, another resource control setting involved is /etc/security/limits.conf. There we have nproc = 100. It is controlled by the hiera key profile::toolforge::bastion::nproc.

@bd808 I propose we raise this limit in the -dev bastion and see if that makes any difference.

Sure, it seems reasonable to experiment. Limits are set to keep things from getting out of control however, so if we end up setting them so that nobody ever hits them we may not be actually limiting anything.

Running into this, GOMAXPROCS=1 worked at first, but I kept needing to adjust it higher and higher (GOMAXPROCS=2, etc.), and eventually no more dice.

I'm not understanding this feedback. @mahmoud are you reporting that GOMAXPROCS=1 kubectl ... causes kubectl to crash under some circumstances?

Ah sorry, to be clear, using GOMAXPROCS appeared to fix the kubectl crash I
was getting, but only at first. Eventually, kubectl would crash no matter
what env vars or ulimit I set.

I came back a few hours later and whatever component had been having the
problem was recycled, and kubectl worked fine.

Still getting this fairly regularly. Anything we can do on our end? Not sure I have reproduction steps, except this happens about half the times I try to shell into montage.

Still getting this fairly regularly. Anything we can do on our end? Not sure I have reproduction steps, except this happens about half the times I try to shell into montage.

This "bug" is actually a feature protecting us a bit from runaway resource usage on the bastions. It may be a slightly flawed feature in that the accounting seems to be off a bit in absolute numbers, but it is a designed feature. In my long analysis (T219070#5051445) I gave the GOMAXPROCS=1 hint as a work around for folks who are having problems. That however is just a work around and one I introduced in a sentence that started 'I was about to give up and tell you "sorry, do not try to run so many things at once"'.

@mahmoud What processes do you start when you log into the bastion? Are you running screen or tmux there? The cgroup limit is on the initial ssh connection to the bastion not on the tool account. If you are hitting this a lot I think it is likely related to your workflow.

I understand completely. It's a shared resource and such a system makes a lot of sense. That's one reason why I avoid customization on such environments. I'm using a bog-standard workflow of

$ ssh mahmoud@login.tools.wmflabs.org
$ become montage
$ webservice python2 shell

My .bashrc is the same one that came with my account. Modify date appears to be 2013. Maybe it has some wonky old stuff in it? Looks standard on first blush.

On the off chance that it's not based on the session, here are other processes allocated to my user:

$ ps aux | grep mahmoud
root     18870  0.0  0.0 138788  7660 ?        SNs  06:06   0:00 sshd: mahmoud [priv]
mahmoud  18881  0.1  0.0  71340  7200 ?        SNs  06:06   0:00 /lib/systemd/systemd --user
mahmoud  18882  0.0  0.0 286160  3192 ?        SN   06:06   0:00 (sd-pam)
mahmoud  18909  0.0  0.0 138788  4880 ?        SN   06:06   0:00 sshd: mahmoud@pts/13
mahmoud  18913  0.2  0.0  44936  5336 pts/13   SNs  06:06   0:00 -bash

As you can see, my session (pts/13) doesn't have a lot going on.

Are my tools' web processes counted against me? I do run a few, but not nearly as much as MusikAnimal and some others :)

Mentioned in SAL (#wikimedia-cloud) [2020-03-17T13:29:03Z] <arturo> set profile::toolforge::bastion::nproc: 200 for tools-sgebastion-08 (T219070)

aborrero claimed this task.

Please @mahmoud try using the -dev bastion, I just made a change and I would like to see if that makes any difference before doing it in the main bastion:

$ ssh mahmoud@stretch-dev.tools.wmflabs.org
$ become montage
$ kubectl get pods
$ webservice python2 shell
[.. etc ..]

Please report back what you experience

Cool. I'll try this for a few days and let you know if I see the issue.

Ah, it worked the first few days, but I just ran into this:

$ ssh mahmoud@stretch-dev.tools.wmflabs.org
bcoeLinux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Debian GNU/Linux 9.7 (stretch)
tools-sgebastion-08 is a Toolforge bastion (role::wmcs::toolforge::bastion)
======================================================================
_______  _____   _____         _______  _____   ______  ______ _______
   |    |     | |     | |      |______ |     | |_____/ |  ____ |______
   |    |_____| |_____| |_____ |       |_____| |    \_ |_____| |______
======================================================================
This is a server of the tools Cloud VPS project, the home of community
managed bots, webservices, and tools supporting the Wikimedia movement.

Use of this system is subject to the Toolforge Terms of Use,
Code of Conduct, and Privacy Policies:
- https://tools.wmflabs.org/?Rules

General guidance and help can be found at:
- https://tools.wmflabs.org/?Help

The last Puppet run was at Tue Mar 24 15:39:30 UTC 2020 (3 minutes ago). 
Last puppet commit: (484d1db564) Elukey - Restore CDH settings for Hadoop Test
Last login: Tue Mar 24 15:42:47 2020 from c-98-210-250-109.hsd1.ca.comcast.net
bmahmoud@tools-sgebastion-08:~$ become montage-dev
(venv)tools.montage-dev@tools-sgebastion-08:~$ kubectl get pods
runtime: failed to create new OS thread (have 39 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x191b57f, 0x9)
	/usr/local/go/src/runtime/panic.go:617 +0x72
... (a ton of stack traces follow) ...

Did the config changes get reset? Or is it just not perfected yet?

I did this test:

(venv)tools.montage-dev@tools-sgebastion-08:~$ while true ; do kubectl get pods ; done
NAME                          READY   STATUS    RESTARTS   AGE
montage-dev-64cb8466c-dqhq9   1/1     Running   0          2d21h
NAME                          READY   STATUS    RESTARTS   AGE
montage-dev-64cb8466c-dqhq9   1/1     Running   0          2d21h
[...]

and left it working for some time, and none of the calls failed.

I wonder if you may have some additional background proc when this failed. Next time, can you check this to see if we get see additional procs?

(venv)tools.montage-dev@tools-sgebastion-08:~$ pstree tools.montage-dev -p
bash(11269)─┬─bash(17289)───kubectl(17490)─┬─{kubectl}(17491)
            │                              ├─{kubectl}(17492)
            │                              ├─{kubectl}(17493)
            │                              ├─{kubectl}(17494)
            │                              ├─{kubectl}(17495)
            │                              ├─{kubectl}(17496)
            │                              ├─{kubectl}(17497)
            │                              ├─{kubectl}(17498)
            │                              ├─{kubectl}(17499)
            │                              └─{kubectl}(17500)
            └─pstree(17501)

I can't reproduce the issue, and nobody else has raised this complain, so I suspect something else is going on unnoticed somewhere.

I just reproduced the issue by doing webservice python2 shell in one session and while true ; do kubectl get pods ; done, in the other. I don't have pstree on tools-sgebastion-07, so maybe other things are different?

I just reproduced the issue by doing webservice python2 shell in one session and while true ; do kubectl get pods ; done, in the other. I don't have pstree on tools-sgebastion-07, so maybe other things are different?

Bam! You are running two interactive shells in parallel against the same quota. Both shells are running kubectl (which is not immediately obvious because it is hidden inside the webservice shell implementation code). Go loves its co-routines this ends up eating all of your quota and begging for more.

One workaround you could do is opening sessions to 2 different bastions (login.toolforge.org + dev.toolforge.org) and splitting your work across them.

Interesting that they count against the same quota, I thought accounting was per-session. But even so, I'd expect the limit would at least reliably allow for one shell + at least one kubectl get pods. And it's still kinda spotty, I wonder why it's nondeterministic.

Change 587715 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: bastion: bump nproc limit to 250

https://gerrit.wikimedia.org/r/587715

Change 587715 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: bastion: bump nproc limit to 250

https://gerrit.wikimedia.org/r/587715

Being a bit forward here today, I just bumped the nproc limit to 250 (from the previous value on 100 in the general bastion and 200 in the -dev one).
50 new procs per user should be more than enough to allow a couple of shell sessions opened in parallel and still honoring the shared/limited nature of our bastion servers.

@mahmoud if this is not enough, please let us know, but anyway I would recommend the approach suggested by @bd808 of using two different bastions for parallel operations.