✩ Help my CI job fails with exit status -11

Help my CI job fails with exit status -11

Figure out a segmentation fault happening in a Docker container

For a few weeks, a CI job had PHPUnit tests abruptly ending with:

returned non-zero exit status -11

The connoisseur [ 1 ] would have recognized that the negative exit status indicates the process exited due to a signal. On Linux, 11 is the value for the SIGSEGV signal, which is usually sent by the kernel to the process as a result of an improper machine instruction. The default behavior is to terminate the process (man 7 signal) and to generate a core dump file (I will come to that later).

But why? Some PHP code ended up triggering a code path in HHVM that would eventually try to read outside of its memory range, or some similar low level fault. The kernel knows that the process completely misbehaved and thus, well, terminates it. Problem solved, you never want your program to misbehave when the kernel is in charge.

The job had recently been switched to use a new container in order to benefit from more recent lib and to match the OS distributions used by the Wikimedia production system. My immediate recommendation was to rollback to the previous known state, but eventually I have let the task to go on and have been absorbed by other tasks (such as updating MediaWiki on the infrastructure).

Last week, the job suddenly began to fail constantly. We prevent code from being merged when a test fails, and thus the code stays in a quarantine zone (Gerrit) and cannot be shipped. A whole team could not ship code (the Language-Team ) for one of their flagship projects (ContentTranslation .) That in turn prevents end users from benefiting from new features they are eager for. The issue had to be acted on and became an unbreak now! kind of task. And I went to my journey.

returned non-zero exit status -11, that is a good enough error message. A process in a Docker container is really just an isolated process and is still managed by the host kernel. First thing I did was to look at the kernel syslog facility on our instances, which yields:

kernel: [7943146.540511] php[14610]:
  segfault at 7f1b16ffad13 ip 00007f1b64787c5e sp 00007f1b53d19d30
     error 4 in libpthread-2.24.so[7f1b64780000+18000]

php there is just HHVM invoked via a php symbolic link. The message hints at libpthread which is where the fault is. But we need a stacktrace to better determine the problem, and ideally a reproduction case.

Thus, what I am really looking for is the core dump file I alluded to earlier. The file is generated by the kernel and contains an image of the process memory at the time of the failure. Given the full copy of the program instructions, the instructions it was running at that time, and all the memory segments, a debugger can reconstruct a human readable state of the failure. That is a backtrace, and is what we rely on to find faulty code and fix bugs.

The core file is not generated. Or the error message would state it had coredumped, i.e. the kernel generated the core dump file. Our default configuration is to not generate any core file, but usually one can adjust it from the shell with ulimit -c XXX where XXX is the maximum size a core file can occupy (in kilobytes, in order to prevent filling the disk). Docker being just a fancy way to start a process, it has a setting to adjust the limit. The docker run inline help states:

--ulimit ulimit Ulimit options (default [])

It is as far as useful as possible, eventually the option to set is: --ulimit core=2147483648 or up to 2 gigabytes. I have updated the CI jobs and instructed them to capture a file named core, the default file name. After a few runs, although I could confirm failures, no files got captured. Why not?

Our machines do not use core as the default filename. It can be found in the kernel configuration:

name=/proc/sys/kernel/core_pattern
/var/tmp/core/core.%h.%e.%p.%t

I thus went on the hosts looking for such files. There were none.

Or maybe I mean None or NaN.

Nada, rien.

The void.

The result is obvious, try to reproduce it! I ran a Docker container doing a basic while loop, from the host I have sent the SIGSEGV signal to the process. The host still had no core file. But surprise it was in the container. Although the kernel is handling it from the host, it is not namespace-aware when it comes time to resolve the path. My quest will soon end, I have simply mounted a host directory to the containers at the expected place:

mkdir /tmp/coredumps
docker run --volume /tmp/coredumps:/var/tmp/core ....

After a few builds, I had harvested enough core files. The investigation is then very straightforward:

$ gdb /usr/bin/hhvm /coredump/core.606eb29eab46.php.2353.1552570410
Core was generated by `php tests/phpunit/phpunit.php --debug-tests --testsuite extensions --exclude-gr'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
813	pthread_create.c: No such file or directory.
[Current thread is 1 (Thread 0x7f55614be3c0 (LWP 2354))]

(gdb) bt
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
#1  0x00007f556f461bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120
#2  0x00007f557214a494 in start_thread (arg=0x7f55614be3c0) at pthread_create.c:456
#3  0x00007f556aeebacf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f55614be3c0, max=<optimized out>)
    at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#4  0x0000000000000000 in ?? ()

Which @Anomie kindly pointed out is an issue solved in libc6. Once the container has been rebuilt to apply the package update, the fault disappears.

One can now expect new changes to appear to ContentTranslation.

[ 1 ] ''connoisseur'', from obsolete French, means "to know" https://en.wiktionary.org/wiki/connoisseur . I guess the English language forgot to apply update on due time and can not make any such change for fear of breaking back compatibility or locution habits.

The task has all the technical details and log leading to solving the issue: T216689: Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11)

(Some light copyedits to above -- Brennen Bearnes)

Production Excellence #10: April 2019

CI working group report, with recommendations of new tools to try

Written by hashar on Mar 21 2019, 9:52 AM.

Logistics

Projects

Subscribers

akosiaris, nshahquinn-wmf, Tobi_WMDE_SW, brennen

Tokens

"Like" token, awarded by abi_."Love" token, awarded by mmodell."Like" token, awarded by Mholloway."The World Burns" token, awarded by zeljkofilipin."Like" token, awarded by kostajh."Orange Medal" token, awarded by Krinkle.

Event Timeline

@hashar Thank you for the extended report! I found it interesting and entertaining.

'connoisseur'', from obsolete French, means "to know"...I guess the English language forgot to apply update on due time and can not make any such change for fear of breaking back compatibility or locution habits.

😂🤣😂

( ͡° ͜ʖ ͡°)

One minor correction

Although the kernel is handling it from the host, it is not namespace-aware when it comes time to resolve the path

It is namespace aware. Hence why the corefiles end up in the container mount namespace and not in /var/tmp/core/core on the host. Which is why the trick with the bind mount of the host directory is required.

Nice job in hunting that down!

We now have a project tag to track tasks related to PHP causing a segmentation fault php-segfault