Page MenuHomePhabricator

CI job mediawiki-quibble-composer-mysql-php80-docker on mediawiki/core gate-and-submit is flaky failing with Segmentation fault
Closed, ResolvedPublic

Description

Upstream issue https://github.com/php/php-src/issues/9323 (released with PHP 8.0.24)


The CI job mediawiki-quibble-composer-mysql-php80-docker sometimes fails with Segmentation fault and blocks gate-and-submit, but it also passed and allows to merge

Yesterday seen on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/820690 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/820811
Today on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/822708

It does not depends on the agent-docker to fail, because one fail on -1024 and one pass

I have not seen if the logs shows if the jobs are running on different servers, the console output looks very similiar (accept the hashes, because boths are different commits) and refers the same php --version: PHP 8.0.17 (cli) (built: Mar 20 2022 17:07:16) ( NTS ).

The job is also running as part of check php.

The failure occurs when running the database backed Phpunit tests:

php tests/phpunit/phpunit.php '-c' 'tests/phpunit/suite.xml' '--group' 'Database' '--exclude-group' 'Broken,ParserFuzz,Stub,Standalone'

The Docker image is docker-registry.wikimedia.org/releng/quibble-buster-php80:1.4.5-s1

The PHP 8.0 Debian packages come sury.org:

apt list php*
Listing... Done
php-common/now 2:92+0~20220117.43+debian10~1.gbpe0d14e all [installed,local]
php8.0-apcu/now 5.1.21+4.0.11-7+0~20220117.31+debian10~1.gbp9a887a amd64 [installed,local]
php8.0-bcmath/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-cli/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-common/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-curl/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-fpm/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-gd/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-gmp/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-igbinary/now 3.2.6+2.0.8-6+0~20220131.33+debian10~1.gbp1d540e amd64 [installed,local]
php8.0-intl/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-ldap/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-mbstring/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-memcached/now 3.1.5+2.2.0-14+0~20220117.37+debian10~1.gbpc8d81e amd64 [installed,local]
php8.0-msgpack/now 2.2.0~rc1+2.1.2+0.5.7-6+0~20220117.33+debian10~1.gbp304533 amd64 [installed,local]
php8.0-mysql/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-opcache/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-pcov/now 1.0.11-4+0~20220117.23+debian10~1.gbp34fff9 amd64 [installed,local]
php8.0-pgsql/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-readline/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-sqlite3/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-tidy/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-xdebug/now 3.1.2+2.9.8+2.8.1+2.5.5-1+0~20220117.45+debian10~1.gbp75d182 amd64 [installed,local]
php8.0-xml/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]
php8.0-zip/now 8.0.17-1+0~20220320.31+debian10~1.gbp34e045 amd64 [installed,local]

Event Timeline

I have also seen the segfault on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/816261, but assumed it was for the changed tests, not flaky.

I have added a few more details to the task. The Docker image has PHP 8.0.17, crawling through the PHP 8.0 changelog at https://www.php.net/ChangeLog-8.php#PHP_8_0 it mentions a few segfaults might be worth updating.

I think we should upgrade php 8.0 to the latest version made available by Sury and see whether it fixes it.

I have some notes about debugging segfault from a quibble image described extensively at T216689#5022211 and following comments. https://gerrit.wikimedia.org/r/c/integration/config/+/496392/ explains how to setup core dump capture:

docker run \
                --volume "$(pwd)"/coredump:/var/tmp/core \
                --ulimit core=2147483648 \
                ...

Use bash has an entry point and run as root, install gdb and the php debug packages (hopefully made available by Sury), run quibble and see what happens.

edit: turns out I wrote a blog post about it! Blog Post: Help my CI job fails with exit status -11

Change 824755 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] dockerfiles: [php80] Upgrade to PHP 8.0.22 and cascade

https://gerrit.wikimedia.org/r/824755

Change 824756 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] jjb: Bump all PHP80 jobs to use images with PHP 8.0.22

https://gerrit.wikimedia.org/r/824756

Change 824755 merged by jenkins-bot:

[integration/config@master] dockerfiles: [php80] Upgrade to PHP 8.0.22 and cascade

https://gerrit.wikimedia.org/r/824755

Mentioned in SAL (#wikimedia-releng) [2022-08-19T15:11:07Z] <James_F> Docker: Building and publishing images with PHP 8.0.22 for T315167

Be worth considering if this is something we might just see in CI, or whether it's something we need to recommend people to use PHP 8 >= 8.0.22...

Obviously being the latest point release, bumping the required PHP version (at least for PHP 8 support) could be fun, but might be a good idea

Change 824756 merged by jenkins-bot:

[integration/config@master] jjb: Bump all PHP80 jobs to use images with PHP 8.0.22

https://gerrit.wikimedia.org/r/824756

https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php80-docker/481/console re-built HEAD (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/822670) with 8.0.22 and was fine.

https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php80-docker/482/console was re-built against https://gerrit.wikimedia.org/r/c/mediawiki/core/+/816261 with 8.0.22 and was also fine.

Does this mean we can call this Resolved? The problem with race-condition/occasional segfaults is knowing when to declare them fixed…

Be worth considering if this is something we might just see in CI, or whether it's something we need to recommend people to use PHP 8 > 8.0.22...

Obviously being the latest point release, bumping the required PHP version (at least for PHP 8 support) could be fun, but might be a good idea

Yeah, I think we shouldn't bump unless it's the segfault is a very frequent thing sysadmins will see on every installation; 8.0.22 only came out 15 days ago, after all. FWICS it's only been seen occasionally in test suites?

Jdforrester-WMF claimed this task.

Theoretically fixed. Please re-open if I'm wrong!

Looks like I'm wrong; this is now happening relatively often. :-(

Grr. I guess I will try to reproduce it with the Quibble CI images based on the notes I have left last time at T315167#8156658. The twist are:

  • enable core dump and bind volume the core destination in the images
  • figure out the command line to reproduce
  • find the dbg packages + install gdb then analyze the core dump inside a container

Grr. I guess I will try to reproduce it with the Quibble CI images based on the notes I have left last time at T315167#8156658

Looking at mediawiki-quibble-composer-mysql-php80-docker job history, it looks like it now only segfaults on REL1_38 and not on master

I have triggered a build of quibble-integration job with mysql/composer against REL1_37, REL1_38 and master. Build 25 https://integration.wikimedia.org/ci/job/quibble-integration/25/

Result:

REL1_37OK
REL1_38segfaults
masterOK

Locally I have tried:

sudo rm -f ./src/LocalSettings.php
docker run \
  -v "$(pwd)/cache-nobody:/cache" \
  -v "$(pwd):/workspace" \
 -e ZUUL_URL=https://gerrit.wikimedia.org/r \
 -e ZUUL_PROJECT=mediawiki/core \
 -e ZUUL_BRANCH=REL1_38 \
 -e ZUUL_REF=REL1_38 \
 --entrypoint=quibble-with-supervisord \
docker-registry.wikimedia.org/releng/quibble-buster-php80:1.4.5-s2 \
 --packages-source=composer \
  --run=phpunit

It passed :-\

Looking at mediawiki-quibble-composer-mysql-php80-docker job history, it looks like it now only segfaults on REL1_38 and not on master

I have triggered a build of quibble-integration job with mysql/composer against REL1_37, REL1_38 and master. Build 25 https://integration.wikimedia.org/ci/job/quibble-integration/25/

Result:

REL1_37OK
REL1_38segfaults
masterOK

Locally I have tried:

sudo rm -f ./src/LocalSettings.php
docker run \
  -v "$(pwd)/cache-nobody:/cache" \
  -v "$(pwd):/workspace" \
 -e ZUUL_URL=https://gerrit.wikimedia.org/r \
 -e ZUUL_PROJECT=mediawiki/core \
 -e ZUUL_BRANCH=REL1_38 \
 -e ZUUL_REF=REL1_38 \
 --entrypoint=quibble-with-supervisord \
docker-registry.wikimedia.org/releng/quibble-buster-php80:1.4.5-s2 \
 --packages-source=composer \
  --run=phpunit

It passed :-\

Could be helpful to pass --debug for PHPUnit so we can see which test(s) it's failing on exactly.

We can find which test ran last by looking at the mw-debug-cli.log.gz file attached to each builds.

I have fetched 3 builds (563, 564 and 548) which all targeted REL1_38 and each failed on the same test:

[PHPUnit] Successful test ApiQueryUserContribsTest::testSorting with data set "All users, , reverse" (array(array('ApiQueryUserContribsTest A', 'ApiQueryUserContribsTest B', 'ApiQueryUserContribsTest C', '192.168.2.1', '192.168.2.2', '192.168.2.3', '192.168.2.4')), true, 18), completed in 0.816447 seconds
[PHPUnit] Start test ApiQueryUserContribsTest::testSorting with data set "User IDs, , reverse" (array(array('ApiQueryUserContribsTest A', 'ApiQueryUserContribsTest B', 'ApiQueryUserContribsTest C')), true, 9)

The one listed for this task (401, which targeted master)

[PHPUnit] Successful test ApiQueryWatchlistIntegrationTest::testIdsPropParameter, completed in 0.066828 seconds
[PHPUnit] Start test ApiQueryWatchlistIntegrationTest::testTitlePropParameter

I will look at enabling coredump collection on one of the Jenkins instance and run the build there.

Change 496392 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: Quibble jobs to capture core files if any

https://gerrit.wikimedia.org/r/496392

Mentioned in SAL (#wikimedia-releng) [2022-08-29T09:00:49Z] <hashar> Updated Jenkins job mediawiki-quibble-composer-mysql-php80-docker to capture core dumps using https://gerrit.wikimedia.org/r/c/integration/config/+/496392 # T315167

The core dump capture I have described at T315167#8156658 can be set using the JJB config https://gerrit.wikimedia.org/r/c/integration/config/+/496392 which I have rebased. I have updated the mediawiki-quibble-composer-mysql-php80-docker job. If there is a core it should show in the console something such as:

Found core file XXX ## bytes. Copying to /tmp on host
Copying XXX /host-tmp-dir

The file should then be on the Jenkins agent under /tmp.

I have triggered a build that got previously triggered by the quibble-integration job: https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php80-docker/575/console

Bah:

09:05:58 docker: Error response from daemon: manifest for docker-registry.wikimedia.org/wikimedia-bullseye:latest not found: manifest unknown: manifest unknown.

Rebuilding at https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php80-docker/576/console which is on integration-agent-docker-1035.integration.eqiad1.wikimedia.cloud

On integration-agent-docker-1035.integration.eqiad1.wikimedia.cloud, I have moved the core file to /home/hashar/T315167.

Inspecting it using:

sudo docker run --rm -it --entrypoint=bash --user=root -v "$(pwd)/T315167:/core:ro" docker-registry.wikimedia.org/releng/quibble-buster-php80:1.4.5-s2

Then in the container we need gdb and I install a few packages having debug symbols

apt update
apt -y install gdb libc-dbg php8.0-cli-dbgsym
curl -o /root/.gdbinit https://raw.githubusercontent.com/php/php-src/master/.gdbinit
gdb /usr/bin/php8.0 /core/core.b28fcda291fe.php.1871.1661764484
gdb
Reading symbols from /usr/bin/php8.0...Reading symbols from /usr/lib/debug/.build-id/7c/46e5ce8f7829ce34b865c33829e28f7e4154e9.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 1871]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `php tests/phpunit/phpunit.php --group Database --exclude-group Broken,ParserFuz'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055b6dd61810c in zend_call_function (fci=0x7fffdef690d0, fci_cache=0x7fffdef690b0)
    at ./Zend/zend_execute_API.c:693
693	./Zend/zend_execute_API.c: No such file or directory.


(gdb) bt
#0  0x000055b6dd61810c in zend_call_function (fci=0x7fffdef690d0, fci_cache=0x7fffdef690b0) at ./Zend/zend_execute_API.c:693
#1  0x000055b6dd618d55 in zend_call_known_function (fn=fn@entry=0x7f4ee8437b40, object=object@entry=0x7f4ed50228c0, called_scope=<optimized out>, retval_ptr=retval_ptr@entry=0x0, 
    param_count=param_count@entry=0, params=params@entry=0x0, named_params=0x0) at ./Zend/zend_execute_API.c:985
#2  0x000055b6dd6a9284 in zend_call_known_instance_method (params=0x0, param_count=0, retval_ptr=0x0, object=0x7f4ed50228c0, fn=0x7f4ee8437b40) at ./Zend/zend_API.h:596
#3  zend_call_known_instance_method_with_0_params (retval_ptr=0x0, object=0x7f4ed50228c0, fn=0x7f4ee8437b40) at ./Zend/zend_API.h:596
#4  zend_objects_destroy_object (object=0x7f4ed50228c0) at ./Zend/zend_objects.c:163
#5  0x000055b6dd69d552 in zend_gc_collect_cycles () at ./Zend/zend_gc.c:1549
#6  0x000055b6dd69bbf8 in gc_possible_root_when_full (ref=0x7f4ed6aa9700) at ./Zend/zend_gc.c:592
#7  0x000055b6dd68b8c2 in execute_ex (ex=0x7fffdef690d0) at ./Zend/zend_vm_execute.h:58430
#8  0x000055b6dd68f83c in zend_execute (op_array=0x7f4eee88c000, return_value=0x0) at ./Zend/zend_vm_execute.h:59499
#9  0x000055b6dd62625d in zend_execute_scripts (type=-293483520, type@entry=8, retval=retval@entry=0x0, file_count=file_count@entry=3) at ./Zend/zend.c:1694
#10 0x000055b6dd5c2beb in php_execute_script (primary_file=<optimized out>) at ./main/main.c:2543
#11 0x000055b6dd6b545e in do_cli (argc=6, argv=0x55b6de0be160) at ./sapi/cli/php_cli.c:949
#12 0x000055b6dd476d0b in main (argc=6, argv=0x55b6de0be160) at ./sapi/cli/php_cli.c:1337
(gdb)

I have added https://github.com/php/php-src/blob/master/.gdbinit as /root/.gdbinit:

(gdb) zbacktrace
[0x7f4eee81cc00] Wikimedia\ParamValidator\TypeDef->normalizeSettings(array(3)[0x7f4eee81cc50]) /workspace/src/includes/libs/ParamValidator/TypeDef.php:Cannot access memory at address 0x1a

(gdb) printzv 0x7f4eee81cc50
[0x7f4eee81cc50] (refcount=3) array:     Hash(3)[0x7f4ed6aa9700]: {
      [0] "param-type" => [0x7f4ed71cd4c0] (refcount=4) string: integer
      [1] "param-ignore-unrecognized-values" => [0x7f4ed71cd4e0] bool: true
      [2] "param-ignore-range" => [0x7f4ed71cd500] bool: true
}

I don't know much about PHP internal. Looks like the garbage collector kicks in, an object is destroyed and Zend would invoke the __destruct method which somehow fails?

This might be nothing, but, the last entry there was from ApiQueryUserContribs::execute:

[DBQuery] ApiQueryUserContribs::execute [0s] localhost:/workspace/db/quibble-mysql-ofkbmcg5/socket: SELECT  actor_id,actor_name,actor_user  FROM `unittest_actor`    WHERE actor_user IN (206,205,207)   ORDER BY actor_user ASC 
[DBQuery] ApiQueryUserContribs::execute [0.001s] localhost:/workspace/db/quibble-mysql-ofkbmcg5/socket: SELECT  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,temp_rev_user.revactor_actor AS `rev_actor`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len  FROM `unittest_revision_actor_temp` `temp_rev_user` JOIN `unittest_revision` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `unittest_revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN `unittest_comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN `unittest_actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) JOIN `unittest_page` ON ((page_id = rev_page))   WHERE ((temp_rev_user.revactor_actor IN (208,209,210) )) AND ((rev_deleted & 12) != 12)  ORDER BY rev_actor,revactor_timestamp,revactor_rev LIMIT 5  
ApiMain::setCacheMode: setting cache mode anon-public-user-private
[DBConnection] Wikimedia\Rdbms\LoadBalancer::getLocalConnection: reused a connection for local/0
[DBQuery] User::loadFromDatabase [0s] localhost:/workspace/db/quibble-mysql-ofkbmcg5/socket: SELECT  user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id  FROM `unittest_user` JOIN `unittest_actor` `user_actor` ON ((user_actor.actor_user = user_id))   WHERE user_id = 1  LIMIT 1  
[session] SessionBackend "up8egdoet5frfaliecmel9d1vd2sdb7a" is unsaved, marking dirty in constructor
[session] SessionBackend "up8egdoet5frfaliecmel9d1vd2sdb7a" save: dataDirty=1 metaDirty=1 forcePersist=0
[session] SessionBackend "up8egdoet5frfaliecmel9d1vd2sdb7a" metadata dirty due to user change
[session] SessionBackend "up8egdoet5frfaliecmel9d1vd2sdb7a" save: dataDirty=0 metaDirty=1 forcePersist=0

Change 496392 abandoned by Hashar:

[integration/config@master] jjb: Quibble jobs to capture core files if any

Reason:

I have collected a core dump and got a stacktrace. See T315167#8193136

https://gerrit.wikimedia.org/r/496392

#0  0x000055b6dd61810c in zend_call_function (fci=0x7fffdef690d0, fci_cache=0x7fffdef690b0) at ./Zend/zend_execute_API.c:693
...
#6  0x000055b6dd69bbf8 in gc_possible_root_when_full (ref=0x7f4ed6aa9700) at ./Zend/zend_gc.c:592

Sorry, I hadn't seen this task, I didn't realise it was affecting CI. I fixed this segfault upstream at https://github.com/php/php-src/pull/9374 . My fix was accepted into the 8.0 branch, so it should be in 8.0.23.

The issue was introduced in PHP 5.3, it is not specific to PHP 8.0. We probably just saw it in PHP 8.0 by chance.

We should have a project tag for PHP segfaults. I want to be notified every time there is a PHP segfault bug.

I find it quite amazing you had a patch ready even before we noticed it. Thank you!

They have released 8.0.23, unfortunately your fix (applied as ba029fce) did not make it to the 8.0 series :-\ I guess we have to wait for 8.0.24 which would be released in October.

With php 5.5 we had segmentation faults due to the garbage collector (T142158, T192432). At the time we went to disable it entirely 9ba09de868951af7a91e318138ff76f172c90d53:

/etc/php/5.5/cli/conf.d/99-enable_gc-off.ini
zend.enable_gc = Off

I am guessing we can reintroduce that for our php 8.0 images.

We should have a project tag for PHP segfaults. I want to be notified every time there is a PHP segfault bug.

I have created php-segfault and made you a member (which doesn't do much) you will want to add yourself as a watcher to receive notifications: https://phabricator.wikimedia.org/project/profile/6130/

The corresponding bug https://github.com/php/php-src/issues/9323 was created the same day as this task.

Thanks, it is good to know a fix is coming.

Jdforrester-WMF changed the task status from Open to Stalled.Sep 28 2022, 8:25 PM

Next step is waiting for PHP 8.0.24 to be released upstream, and then updating our CI images to match. Stalled until then.

SitRep: 8.0.24 is released but not yet packaged for Debian. Will poke tomorrow.

Change 837135 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] dockerfiles: [php80] Update to PHP 8.0.24 and cascade

https://gerrit.wikimedia.org/r/837135

Change 837135 merged by jenkins-bot:

[integration/config@master] dockerfiles: [php80] Update to PHP 8.0.24 and cascade

https://gerrit.wikimedia.org/r/837135

Mentioned in SAL (#wikimedia-releng) [2022-09-30T15:12:56Z] <James_F> Docker: Building and publishing PHP 8.0.24 images for T315167