Summary
- Install debugging packages: apt-get -y install php7.4-common-dbgsym php7.4-cli-dbgsym
- curl -o ~/php-gdbinit https://raw.githubusercontent.com/php/php-src/php-7.4.30/.gdbinit
- gdb <your php command>
- Enter run then once the command has failed: bt and zbacktrace
The Beta-Cluster-Infrastructure is a farm of wikis we use for experimentation and integration testing. It is updated continuously: new code is every ten minutes and the databases every hour by running MediaWiki maintenance/update.php. The scheduling and running are driven by Jenkins jobs which statuses can be seen on the Beta view:
On top of that, Jenkins will emit notification messages to IRC as long as one of the update job fails. One of them started failing on July 25th and this is how I was seeing it the alarm (times are for France, UTC+2):
(wmf-insecte is the Jenkins bot, insecte is french for bug (animals), and the wmf- prefix identifies it as a Wikimedia Foundation robot).
Clicking on the link gives the output of the update script which eventually fails with:
+ /usr/local/bin/mwscript update.php --wiki=wikifunctionswiki --quick --skip-config-validation 20:31:09 ...wikilambda_zlanguages table already exists. 20:31:09 ...have wlzl_label_primary field in wikilambda_zobject_labels table. 20:31:09 ...have wlzl_return_type field in wikilambda_zobject_labels table. 20:31:09 /usr/local/bin/mwscript: line 27: 1822 Segmentation fault sudo -u "$MEDIAWIKI_WEB_USER" $PHP "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@"
The important bit is Segmentation fault which indicates the program (php) had a fatal fault and it got rightfully killed by the Linux Kernel. Looking at the instance Linux Kernel messages via dmesg -T:
[Mon Jul 24 23:33:55 2023] php[28392]: segfault at 7ffe374f5db8 ip 00007f8dc59fc807 sp 00007ffe374f5da0 error 6 in libpcre2-8.so.0.7.1[7f8dc59b9000+5d000] [Mon Jul 24 23:33:55 2023] Code: ff ff 31 ed e9 74 fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 d5 53 44 89 c3 48 81 ec 98 52 00 00 <48> 89 7c 24 18 4c 8b a4 24 d0 52 00 00 48 89 74 24 10 48 89 4c 24 [Mon Jul 24 23:33:55 2023] Core dump to |/usr/lib/systemd/systemd-coredump 28392 33 33 11 1690242166 0 php pipe failed
With those data, I had enough to the most urgent step: file a task (T342769) which can be used as an audit trail and reference for the future. It is the single most important step I am doing whenever I am debugging an issue, since if I have to stop due to time constraint or lack of technical abilities, others can step in and continue. It also provides an historical record that can be looked up in the future, and indeed this specific problem already got investigated and fully documented a couple years ago. Having a task is the most important thing one must do whenever debugging, it is invaluable. For PHP segmentation fault, we even have a dedicated project php-segfault
With the task filed, I have continued the investigation. The previous successful build had:
19:30:18 ...have wlzl_label_primary field in wikilambda_zobject_labels table. 19:30:18 ...have wlzl_return_type field in wikilambda_zobject_labels table. 19:30:18 ❌ Unable to make a page for Z7138: The provided content's label clashes with Object 'Z10138' for the label in 'Z1002'. 19:30:18 ❌ Unable to make a page for Z7139: The provided content's label clashes with Object 'Z10139' for the label in 'Z1002'. 19:30:18 ❌ Unable to make a page for Z7140: The provided content's label clashes with Object 'Z10140' for the label in 'Z1002'. 19:30:18 ...site_stats is populated...done.
The successful build started at 19:20 UTC and the failing one finished at 20:30 UTC which gives us a short time window to investigate. Since the failure seems to happen after updating the WikiLambda MediaWiki extension, I went to inspect the few commits that got merged at that time. I took advantage of Gerrit adding review actions as git notes, notably the exact time a change got submitted and subsequently merged. The process:
Clone the suspect repository:
git clone https://gerrit.wikimedia.org/r/extensions/WikiLambda cd WikiLambda
Fetch the Gerrit review notes:
git fetch origin refs/notes/review:refs/notes/review
The review notes can be shown below the commit by passing --notes=review to git log or git show, an example for the current HEAD of the repository:
$ git show -q --notes=review commit c7f8071647a1aeb2cef6b9310ccbf3a87af2755b (HEAD -> master, origin/master, origin/HEAD) Author: Genoveva Galarza <ggalarzaheredero@wikimedia.org> Date: Thu Jul 27 00:34:03 2023 +0200 Initialize blank function when redirecting to FunctionEditor from DefaultView Bug: T342802 Change-Id: I09d3400db21983ac3176a0bc325dcfe2ddf23238 Notes (review): Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org> Verified+2: jenkins-bot Code-Review+2: Jforrester <jforrester@wikimedia.org> Submitted-by: jenkins-bot Submitted-at: Wed, 26 Jul 2023 22:47:59 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/942026 Project: mediawiki/extensions/WikiLambda Branch: refs/heads/master
Which shows this change has been approved by Jforrester and entered the repository on Wed, 26 Jul 2023 22:47:59 UTC. Then to find the commits in that range, I ask git log to list:
- anything that has a commit date for the day (it is not necessarily correct but in this case it is a good enough approximation)
- from oldest to newest
- sorted by topology order (aka in the order the commit entered the repository rather than based on the commit date)
- show the review notes to get the Submitted-at field
I can then scroll to the commits having a Submitted-at in the time window of 19:20 UTC - 20:30 UTC. I have amended the below output to remove most of the review notes except for the first commit:
$ git log --oneline --since=2023/07/25 --reverse --notes=review --no-merges --topo-order <scroll> 653ea81a Handle oldid url param to view a particular revision Notes (review): Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org> Verified+2: jenkins-bot Code-Review+2: Jforrester <jforrester@wikimedia.org> Submitted-by: jenkins-bot Submitted-at: Tue, 25 Jul 2023 19:26:53 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941482 Project: mediawiki/extensions/WikiLambda Branch: refs/heads/master fe4b0446 AUTHORS: Update for July 2023 Notes (review): Submitted-at: Tue, 25 Jul 2023 19:49:43 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941507 73fcb4a4 Update function-schemata sub-module to HEAD (1c01f22) Notes (review): Submitted-at: Tue, 25 Jul 2023 19:59:23 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941384 598f5fcc PageRenderingHandler: Don't make 'read' selected if we're on the edit tab Notes (review): Submitted-at: Tue, 25 Jul 2023 20:16:05 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941456
Or in a Phabricator task and human friendly way:
- 19:26 Handle oldid url param to view a particular revision
- 19:43 AUTHORS: Update for July 2023
- 19:59 Update function-schemata sub-module to HEAD (1c01f22) for T335583
- 20:16 PageRenderingHandler: Don't make 'read' selected if we're on the edit tab
The Update function-schemata sub-module to HEAD (1c01f22) has a short log of changes it introduces:
- New changes:
- abc4aa6 definitions: Add Z1908/bug-bugi and Z1909/bug-lant ZNaturalLanguages
- 0f1941e definitions: Add Z1910/piu ZNaturalLanguage
- 1c01f22 definitions: Re-label all objects to drop the 'Z' per Amin
Since the update script fail on WikiLambda I have reached out to its developers so they can investigate their code and maybe find what can trigger the issue.
On the PHP side we need a trace. That can be done by configuring the Linux Kernel to take a dump of the program before terminating it and having it stored on disk, it did not quite work due to a configuration issue on the machine and in the first attempt we forgot to run the command by asking bash to allow the dump generation (ulimit -c unlimited). From a past debugging session, I went to run the command directly under the GNU debugger: gdb.
There are a few preliminary step to debug the PHP program, at first one needs to install the debug symbols which lets the debugger map the binary entries to lines of the original source code. Since error mentions libpcre2 I also installed its debugging symbols:
$ sudo apt-get -y install php7.4-common-dbgsym php7.4-cli-dbgsym libpcre2-dbg
I then used gdb to start a debugging session:
sudo -s -u www-data gdb --args /usr/bin/php /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=wikifunctionswiki --quick --skip-config-validation gdb>
Then ask gdb to start the program by entering in the input prompt: run ⏎. After several minutes, it caught the segmentation fault:
gdb> run <output> <output freeze for several minutes while update.php is doing something> Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, start_offset=start_offset@entry=0, options=0, match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870) at src/pcre2_match.c:6001 6001 src/pcre2_match.c: No such file or directory.
I could not find a debugging symbol package containing src/pcre2_match.c but that was not needed afterall.
To retrieve the stacktrace enter to the gdb prompt bt ⏎:
gdb> bt #0 0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, start_offset=start_offset@entry=0, options=0, match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870) at src/pcre2_match.c:6001 #1 0x00005555556a3b24 in php_pcre_match_impl (pce=0x7fffe83685a0, subject_str=0x7fffcb410a80, return_value=0x7fffcb44b220, subpats=0x0, global=0, use_flags=<optimized out>, flags=0, start_offset=0) at ./ext/pcre/php_pcre.c:1300 #2 0x00005555556a493b in php_do_pcre_match (execute_data=0x7fffcb44b710, return_value=0x7fffcb44b220, global=0) at ./ext/pcre/php_pcre.c:1149 #3 0x00007ffff216a3cb in tideways_xhprof_execute_internal () from /usr/lib/php/20190902/tideways_xhprof.so #4 0x000055555587ddee in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1732 #5 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #6 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #7 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #8 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #9 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #10 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #11 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #12 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #13 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #14 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #15 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #16 0x000055555587c63c in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1602 #17 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53535 #18 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #19 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #20 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #21 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #22 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #23 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #24 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #25 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.Quit CONTINUING
Which is not that helpful. Thankfully the PHP project provides a set of macro for gdb which lets one map the low level C code to the PHP code that was expected. It is provided in their source repository /.gdbinit and one should use the version from the PHP branch being debugged, since we use php 7.4 I went to use the version from the latest 7.4 series (7.4.30 at the time of this writing): https://raw.githubusercontent.com/php/php-src/php-7.4.30/.gdbinit
Download the file to your home directory (ex: /home/hashar/gdbinit) and ask gdb to import it with, for example, source /home/hashar/gdbinit ⏎:
(gdb) source /home/hashar/gdbinit
This provides a few new commands to show PHP Zend values and to generate a very helpfull stacktrace (zbacktrace):
(gdb) zbacktrace [0x7fffcb44b710] preg_match("\7^Z[1-9]\d*$\7u", "Z1002") [internal function] [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219 [0x7fffcb44a760] Opis\JsonSchema\Validator->validateProperties(reference, reference, array(0)[0x7fffcb44a7d0], array(7)[0x7fffcb44a7e0], object[0x7fffcb44a7f0], object[0x7fffcb44a800], object[0x7fffcb44a810], NULL) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:943 [0x7fffcb44a4c0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb44a530], array(7)[0x7fffcb44a540], object[0x7fffcb44a550], object[0x7fffcb44a560], object[0x7fffcb44a570]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:519 [0x7fffcb44a310] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb44a380], array(7)[0x7fffcb44a390], object[0x7fffcb44a3a0], object[0x7fffcb44a3b0], object[0x7fffcb44a3c0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 [0x7fffcb449350] Opis\JsonSchema\Validator->validateConditionals(reference, reference, array(0)[0x7fffcb4493c0], array(7)[0x7fffcb4493d0], object[0x7fffcb4493e0], object[0x7fffcb4493f0], object[0x7fffcb449400]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:703 [0x7fffcb4490b0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb449120], array(7)[0x7fffcb449130], object[0x7fffcb449140], object[0x7fffcb449150], object[0x7fffcb449160]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:523 [0x7fffcb448f00] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb448f70], array(7)[0x7fffcb448f80], object[0x7fffcb448f90], object[0x7fffcb448fa0], object[0x7fffcb448fb0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 <loop>
The stacktrace shows the code entered an infinite loop while validating a Json schema up to a point it is being stopped.
The arguments can be further inspected by using printz and giving it as argument an object reference. For the line:
For [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219
(gdb) printzv 0x7fffcb44ac10 [0x7fffcb44ac10] (refcount=2) array: Hash(0)[0x5555559d7f00]: { }
(gdb) printzv 0x7fffcb44ac20 [0x7fffcb44ac20] (refcount=21) array: Packed(7)[0x7fffcb486118]: { [0] 0 => [0x7fffcb445748] (refcount=17) string: Z2K2 [1] 1 => [0x7fffcb445768] (refcount=18) string: Z4K2 [2] 2 => [0x7fffcb445788] long: 1 [3] 3 => [0x7fffcb4457a8] (refcount=15) string: Z3K3 [4] 4 => [0x7fffcb4457c8] (refcount=10) string: Z12K1 [5] 5 => [0x7fffcb4457e8] long: 1 [6] 6 => [0x7fffcb445808] (refcount=6) string: Z11K1 }
(gdb) printzv 0x7fffcb44ac30 [0x7fffcb44ac30] (refcount=22) object(Opis\JsonSchema\Schema) #485450 { id => [0x7fffcb40f508] (refcount=3) string: /Z6# draft => [0x7fffcb40f518] (refcount=1) string: 07 internal => [0x7fffcb40f528] (refcount=1) reference: [0x7fffcb6704e8] (refcount=1) array: Hash(1)[0x7fffcb4110e0]: { [0] "/Z6#" => [0x7fffcb71d280] (refcount=1) object(stdClass) #480576 }
(gdb) printzv 0x7fffcb44ac40 [0x7fffcb44ac40] (refcount=5) object(stdClass) #483827 Properties Hash(1)[0x7fffcb6aa2a0]: { [0] "pattern" => [0x7fffcb67e3c0] (refcount=1) string: ^Z[1-9]\d*$ }
(gdb) printzv 0x7fffcb44ac50 [0x7fffcb44ac50] (refcount=5) object(Opis\JsonSchema\ValidationResult) #486348 { maxErrors => [0x7fffcb4393e8] long: 1 errors => [0x7fffcb4393f8] (refcount=2) array: Hash(0)[0x5555559d7f00]: { }
Extracting the parameters was enough for WikiLambda developers to find the immediate root cause, they have removed some definitions which triggered the infinite loop and manually ran a script to reload the data in the Database. Eventually the Jenkins job managed to update the wiki database:
16:30:26 <wmf-insecte> Project beta-update-databases-eqiad build #69029: FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/69029/
One problem solved!
References:
- T342769: Beta update.php fails due to PHP segfault in libpcre2-8.so.0.7.1
- A similar debugging session I had to do for update.php back in 2021 {T296539#7531235}
- 2006 notes by @tstarling from whom I have learned about gdb and the PHP gdb config: https://wikitech.wikimedia.org/wiki/GDB_with_PHP
- Projects
- Subscribers
- Mervekaya1992, xSavitar, tstarling
- Tokens