Investigate a PHP segmentation fault
quote: "The C programming language is an interpreted language"

Summary


The Beta-Cluster-Infrastructure is a farm of wikis we use for experimentation and integration testing. It is updated continuously: new code is every ten minutes and the databases every hour by running MediaWiki maintenance/update.php. The scheduling and running are driven by Jenkins jobs which statuses can be seen on the Beta view:

image.png (192×928 px, 33 KB)

On top of that, Jenkins will emit notification messages to IRC as long as one of the update job fails. One of them started failing on July 25th and this is how I was seeing it the alarm (times are for France, UTC+2):

image.png (92×941 px, 37 KB)

(wmf-insecte is the Jenkins bot, insecte is french for bug (animals), and the wmf- prefix identifies it as a Wikimedia Foundation robot).

Clicking on the link gives the output of the update script which eventually fails with:

+ /usr/local/bin/mwscript update.php --wiki=wikifunctionswiki --quick --skip-config-validation
20:31:09 ...wikilambda_zlanguages table already exists.
20:31:09 ...have wlzl_label_primary field in wikilambda_zobject_labels table.
20:31:09 ...have wlzl_return_type field in wikilambda_zobject_labels table.
20:31:09 /usr/local/bin/mwscript: line 27:  1822 Segmentation fault      sudo -u "$MEDIAWIKI_WEB_USER" $PHP "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@"

The important bit is Segmentation fault which indicates the program (php) had a fatal fault and it got rightfully killed by the Linux Kernel. Looking at the instance Linux Kernel messages via dmesg -T:

[Mon Jul 24 23:33:55 2023] php[28392]: segfault at 7ffe374f5db8 ip 00007f8dc59fc807 sp 00007ffe374f5da0 error 6 in libpcre2-8.so.0.7.1[7f8dc59b9000+5d000]
[Mon Jul 24 23:33:55 2023] Code: ff ff 31 ed e9 74 fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 d5 53 44 89 c3 48 81 ec 98 52 00 00 <48> 89 7c 24 18 4c 8b a4 24 d0 52 00 00 48 89 74 24 10 48 89 4c 24
[Mon Jul 24 23:33:55 2023] Core dump to |/usr/lib/systemd/systemd-coredump 28392 33 33 11 1690242166 0 php pipe failed

With those data, I had enough to the most urgent step: file a task (T342769) which can be used as an audit trail and reference for the future. It is the single most important step I am doing whenever I am debugging an issue, since if I have to stop due to time constraint or lack of technical abilities, others can step in and continue. It also provides an historical record that can be looked up in the future, and indeed this specific problem already got investigated and fully documented a couple years ago. Having a task is the most important thing one must do whenever debugging, it is invaluable. For PHP segmentation fault, we even have a dedicated project php-segfault

With the task filed, I have continued the investigation. The previous successful build had:

19:30:18 ...have wlzl_label_primary field in wikilambda_zobject_labels table.
19:30:18 ...have wlzl_return_type field in wikilambda_zobject_labels table.
19:30:18 	❌ Unable to make a page for Z7138: The provided content's label clashes with Object 'Z10138' for the label in 'Z1002'.
19:30:18 	❌ Unable to make a page for Z7139: The provided content's label clashes with Object 'Z10139' for the label in 'Z1002'.
19:30:18 	❌ Unable to make a page for Z7140: The provided content's label clashes with Object 'Z10140' for the label in 'Z1002'.
19:30:18 ...site_stats is populated...done.

The successful build started at 19:20 UTC and the failing one finished at 20:30 UTC which gives us a short time window to investigate. Since the failure seems to happen after updating the WikiLambda MediaWiki extension, I went to inspect the few commits that got merged at that time. I took advantage of Gerrit adding review actions as git notes, notably the exact time a change got submitted and subsequently merged. The process:

Clone the suspect repository:

git clone https://gerrit.wikimedia.org/r/extensions/WikiLambda
cd WikiLambda

Fetch the Gerrit review notes:

git fetch origin refs/notes/review:refs/notes/review

The review notes can be shown below the commit by passing --notes=review to git log or git show, an example for the current HEAD of the repository:

$ git show -q --notes=review
commit c7f8071647a1aeb2cef6b9310ccbf3a87af2755b (HEAD -> master, origin/master, origin/HEAD)
Author: Genoveva Galarza <ggalarzaheredero@wikimedia.org>
Date:   Thu Jul 27 00:34:03 2023 +0200

    Initialize blank function when redirecting to FunctionEditor from DefaultView
    
    Bug: T342802
    Change-Id: I09d3400db21983ac3176a0bc325dcfe2ddf23238

Notes (review):
    Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org>
    Verified+2: jenkins-bot
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Submitted-by: jenkins-bot
    Submitted-at: Wed, 26 Jul 2023 22:47:59 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/942026
    Project: mediawiki/extensions/WikiLambda
    Branch: refs/heads/master

Which shows this change has been approved by Jforrester and entered the repository on Wed, 26 Jul 2023 22:47:59 UTC. Then to find the commits in that range, I ask git log to list:

  • anything that has a commit date for the day (it is not necessarily correct but in this case it is a good enough approximation)
  • from oldest to newest
  • sorted by topology order (aka in the order the commit entered the repository rather than based on the commit date)
  • show the review notes to get the Submitted-at field

I can then scroll to the commits having a Submitted-at in the time window of 19:20 UTC - 20:30 UTC. I have amended the below output to remove most of the review notes except for the first commit:

$ git log --oneline --since=2023/07/25 --reverse --notes=review --no-merges --topo-order
<scroll>
653ea81a Handle oldid url param to view a particular revision
Notes (review):
    Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org>
    Verified+2: jenkins-bot
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Submitted-by: jenkins-bot
    Submitted-at: Tue, 25 Jul 2023 19:26:53 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941482
    Project: mediawiki/extensions/WikiLambda
    Branch: refs/heads/master

fe4b0446 AUTHORS: Update for July 2023
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 19:49:43 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941507

73fcb4a4 Update function-schemata sub-module to HEAD (1c01f22)
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 19:59:23 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941384

598f5fcc PageRenderingHandler: Don't make 'read' selected if we're on the edit tab
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 20:16:05 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941456

Or in a Phabricator task and human friendly way:

The Update function-schemata sub-module to HEAD (1c01f22) has a short log of changes it introduces:

  • New changes:
  • abc4aa6 definitions: Add Z1908/bug-bugi and Z1909/bug-lant ZNaturalLanguages
  • 0f1941e definitions: Add Z1910/piu ZNaturalLanguage
  • 1c01f22 definitions: Re-label all objects to drop the 'Z' per Amin

Since the update script fail on WikiLambda I have reached out to its developers so they can investigate their code and maybe find what can trigger the issue.

On the PHP side we need a trace. That can be done by configuring the Linux Kernel to take a dump of the program before terminating it and having it stored on disk, it did not quite work due to a configuration issue on the machine and in the first attempt we forgot to run the command by asking bash to allow the dump generation (ulimit -c unlimited). From a past debugging session, I went to run the command directly under the GNU debugger: gdb.

There are a few preliminary step to debug the PHP program, at first one needs to install the debug symbols which lets the debugger map the binary entries to lines of the original source code. Since error mentions libpcre2 I also installed its debugging symbols:

$ sudo apt-get -y install php7.4-common-dbgsym php7.4-cli-dbgsym libpcre2-dbg

I then used gdb to start a debugging session:

sudo  -s -u www-data gdb --args /usr/bin/php /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=wikifunctionswiki --quick --skip-config-validation
gdb>

Then ask gdb to start the program by entering in the input prompt: run . After several minutes, it caught the segmentation fault:

gdb> run
<output>
<output freeze for several minutes while update.php is doing something>

Thread 1 "php" received signal SIGSEGV, Segmentation fault.
0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, 
    subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, 
    start_offset=start_offset@entry=0, options=0, 
    match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870)
    at src/pcre2_match.c:6001
6001	src/pcre2_match.c: No such file or directory.

I could not find a debugging symbol package containing src/pcre2_match.c but that was not needed afterall.

To retrieve the stacktrace enter to the gdb prompt bt :

gdb> bt
#0  0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, 
    subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, 
    start_offset=start_offset@entry=0, options=0, 
    match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870)
    at src/pcre2_match.c:6001
#1  0x00005555556a3b24 in php_pcre_match_impl (pce=0x7fffe83685a0, 
    subject_str=0x7fffcb410a80, return_value=0x7fffcb44b220, subpats=0x0, global=0, 
    use_flags=<optimized out>, flags=0, start_offset=0) at ./ext/pcre/php_pcre.c:1300
#2  0x00005555556a493b in php_do_pcre_match (execute_data=0x7fffcb44b710, 
    return_value=0x7fffcb44b220, global=0) at ./ext/pcre/php_pcre.c:1149
#3  0x00007ffff216a3cb in tideways_xhprof_execute_internal ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#4  0x000055555587ddee in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1732
#5  execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#6  0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#7  0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#8  execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#9  0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#10 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#11 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#12 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#13 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#14 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#15 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#16 0x000055555587c63c in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1602
#17 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53535
#18 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#19 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#20 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#21 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#22 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#23 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#24 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#25 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
 at ./Zend/zend_vm_execute.Quit
CONTINUING

Which is not that helpful. Thankfully the PHP project provides a set of macro for gdb which lets one map the low level C code to the PHP code that was expected. It is provided in their source repository /.gdbinit and one should use the version from the PHP branch being debugged, since we use php 7.4 I went to use the version from the latest 7.4 series (7.4.30 at the time of this writing): https://raw.githubusercontent.com/php/php-src/php-7.4.30/.gdbinit

Download the file to your home directory (ex: /home/hashar/gdbinit) and ask gdb to import it with, for example, source /home/hashar/gdbinit :

(gdb) source /home/hashar/gdbinit

This provides a few new commands to show PHP Zend values and to generate a very helpfull stacktrace (zbacktrace):

(gdb) zbacktrace
[0x7fffcb44b710] preg_match("\7^Z[1-9]\d*$\7u", "Z1002") [internal function]
[0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219 
[0x7fffcb44a760] Opis\JsonSchema\Validator->validateProperties(reference, reference, array(0)[0x7fffcb44a7d0], array(7)[0x7fffcb44a7e0], object[0x7fffcb44a7f0], object[0x7fffcb44a800], object[0x7fffcb44a810], NULL) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:943 
[0x7fffcb44a4c0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb44a530], array(7)[0x7fffcb44a540], object[0x7fffcb44a550], object[0x7fffcb44a560], object[0x7fffcb44a570]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:519 
[0x7fffcb44a310] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb44a380], array(7)[0x7fffcb44a390], object[0x7fffcb44a3a0], object[0x7fffcb44a3b0], object[0x7fffcb44a3c0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 
[0x7fffcb449350] Opis\JsonSchema\Validator->validateConditionals(reference, reference, array(0)[0x7fffcb4493c0], array(7)[0x7fffcb4493d0], object[0x7fffcb4493e0], object[0x7fffcb4493f0], object[0x7fffcb449400]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:703 
[0x7fffcb4490b0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb449120], array(7)[0x7fffcb449130], object[0x7fffcb449140], object[0x7fffcb449150], object[0x7fffcb449160]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:523 
[0x7fffcb448f00] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb448f70], array(7)[0x7fffcb448f80], object[0x7fffcb448f90], object[0x7fffcb448fa0], object[0x7fffcb448fb0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 
<loop>

The stacktrace shows the code entered an infinite loop while validating a Json schema up to a point it is being stopped.

The arguments can be further inspected by using printz and giving it as argument an object reference. For the line:

For [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219
(gdb) printzv 0x7fffcb44ac10
[0x7fffcb44ac10] (refcount=2) array:     Hash(0)[0x5555559d7f00]: {
}
(gdb) printzv 0x7fffcb44ac20
[0x7fffcb44ac20] (refcount=21) array:     Packed(7)[0x7fffcb486118]: {
      [0] 0 => [0x7fffcb445748] (refcount=17) string: Z2K2
      [1] 1 => [0x7fffcb445768] (refcount=18) string: Z4K2
      [2] 2 => [0x7fffcb445788] long: 1
      [3] 3 => [0x7fffcb4457a8] (refcount=15) string: Z3K3
      [4] 4 => [0x7fffcb4457c8] (refcount=10) string: Z12K1
      [5] 5 => [0x7fffcb4457e8] long: 1
      [6] 6 => [0x7fffcb445808] (refcount=6) string: Z11K1
}
(gdb) printzv 0x7fffcb44ac30
[0x7fffcb44ac30] (refcount=22) object(Opis\JsonSchema\Schema) #485450 {
id => [0x7fffcb40f508] (refcount=3) string: /Z6#
draft => [0x7fffcb40f518] (refcount=1) string: 07
internal => [0x7fffcb40f528] (refcount=1) reference: [0x7fffcb6704e8] (refcount=1) array:     Hash(1)[0x7fffcb4110e0]: {
      [0] "/Z6#" => [0x7fffcb71d280] (refcount=1) object(stdClass) #480576
}
(gdb) printzv 0x7fffcb44ac40
[0x7fffcb44ac40] (refcount=5) object(stdClass) #483827
Properties     Hash(1)[0x7fffcb6aa2a0]: {
      [0] "pattern" => [0x7fffcb67e3c0] (refcount=1) string: ^Z[1-9]\d*$
}
(gdb) printzv 0x7fffcb44ac50
[0x7fffcb44ac50] (refcount=5) object(Opis\JsonSchema\ValidationResult) #486348 {
maxErrors => [0x7fffcb4393e8] long: 1
errors => [0x7fffcb4393f8] (refcount=2) array:     Hash(0)[0x5555559d7f00]: {
}

Extracting the parameters was enough for WikiLambda developers to find the immediate root cause, they have removed some definitions which triggered the infinite loop and manually ran a script to reload the data in the Database. Eventually the Jenkins job managed to update the wiki database:

16:30:26 <wmf-insecte> Project beta-update-databases-eqiad build #69029: FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/69029/

One problem solved!

References:

Written by hashar on Jul 28 2023, 12:06 PM.
Logistics
Projects
Subscribers
Mervekaya1992, xSavitar, tstarling
Tokens
"Love" token, awarded by Krinkle."Mountain of Wealth" token, awarded by Bawolff."Like" token, awarded by xSavitar."Like" token, awarded by thcipriani.

Event Timeline