Page MenuHomePhabricator

phpunit segfaults when too many tests are run
Closed, ResolvedPublic

Description

When too many tests are run, phpunit will segfault after it finishes running the tests. It might be related to the total number of asserts, or memory allocated, but mediawiki core is close to some limit, where patchsets with a lot of tests can trigger this condition (see https://gerrit.wikimedia.org/r/#/c/77645/, patchsets 23-25).

I was able to reproduce the segfault in a patch that only contained noop tests too. Somewhere between 150-175 tests that just assertFalse(false) will trigger it. See https://gerrit.wikimedia.org/r/#/c/118573/ for plenty of patchsets that reproduce the issue.

I'm afraid if we add too many more tests to core, we're going to quickly start hitting this more frequently, so it would be good to try and get to the root of it.


Version: unspecified
Severity: normal

Details

Reference
bz62623

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:08 AM
bzimport set Reference to bz62623.
bzimport added a subscriber: Unknown Object (MLST).
csteipp created this task.Mar 13 2014, 9:56 PM

Steps that (should) reproduce this on Gallium (assuming hashar's LocalSettings is readable):

  • Get a local copy of MediaWiki

~> git clone https://gerrit.wikimedia.org/r/mediawiki/core.git
~> cd core
~/core> git fetch https://gerrit.wikimedia.org/r/mediawiki/core refs/changes/73/118573/9 && git checkout FETCH_HEAD
~/core> cd tests/phpunit
~/core/tests/phpunit> gdb --args php phpunit.php --conf /home/hashar/core/LocalSettings.php --exclude-group Database,Broken,ParserFuzz,Stub --log-junit /tmp/junit.xml --; echo $?

The PHPUnit test suite does pass but something segfault very late in PHP. That sounds like the segfault issue we have when generating code coverage report (bug 43972 which has a bunch of stacktrace).

Since apparently the error is not reproducible on labs instance, one would have to setup MediaWiki in his home on gallium and run the tests with gdb attached.

(In reply to Chris Steipp from comment #1)

Steps that (should) reproduce this on Gallium (assuming hashar's
LocalSettings is readable):

I tried this on gallium (with the addition of --with-phpunitdir) and wasn't able to reproduce a segfault. Could you point me to a Jenkins log that shows the segfault so that I can attempt to reproduce the environment?

On a more general note, I think it might be a good idea to modify the test environment so that it dumps corefiles. Trying to reproduce segfaults after the fact isn't always going to be easy (case in point) and being able to fetch the coredump for further analysis could be immensely helpful in the future as well.

Change 119222 had a related patch set uploaded by Faidon Liambotis:
Dump corefiles on segfaults

https://gerrit.wikimedia.org/r/119222

(In reply to Faidon Liambotis from comment #3)

I tried this on gallium (with the addition of --with-phpunitdir) and wasn't
able to reproduce a segfault. Could you point me to a Jenkins log that shows
the segfault so that I can attempt to reproduce the environment?

https://integration.wikimedia.org/ci/job/mediawiki-core-phpunit-databaseless/22233/console
https://integration.wikimedia.org/ci/job/mediawiki-core-phpunit-databaseless/22058/console

This was also an interesting one in that it reported zend heap corruption instead of just segfaulting: https://integration.wikimedia.org/ci/job/mediawiki-core-phpunit-databaseless/22171/console

On a more general note, I think it might be a good idea to modify the test
environment so that it dumps corefiles. Trying to reproduce segfaults after
the fact isn't always going to be easy (case in point) and being able to
fetch the coredump for further analysis could be immensely helpful in the
future as well.

I think that would be useful.

Change 119225 had a related patch set uploaded by Faidon Liambotis:
contint: set up and maintain a coredumps directory

https://gerrit.wikimedia.org/r/119225

Change 119222 merged by jenkins-bot:
Dump corefiles on segfaults

https://gerrit.wikimedia.org/r/119222

I deployed Faidon change which instructs the kernel to dump a core file, then rebased the MediaWiki core change and got a core file generated \O/

The core file is available on tin.eqiad.wmnet:/home/hashar/bug62623.core

BFD: Warning: /home/hashar/bug62623.core is truncated: expected core file size >= 651972608, found: 65609728.
[New LWP 19784]
Cannot access memory at address 0x7fc3bbec72a8
Cannot access memory at address 0x7fc3bbec72a0
(gdb) bt
#0 0x00000000006b9e9f in zend_mm_remove_from_free_list (heap=0x157eb20, mm_block=0x15e35ad0)

at /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c:889

Cannot access memory at address 0x7fffff9ae118

A similar backtrace from https://bugzilla.wikimedia.org/show_bug.cgi?id=43972#c26 using PHP 5.3.10-1ubuntu3.7+wmf1

Program received signal SIGSEGV, Segmentation fault.
0x00000000006b9a7f in zend_mm_remove_from_free_list (heap=0xdd5aa0, mm_block=0x15c28738)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_alloc_canary.c:889

889 /home/akosiaris/php5/php5-5.3.10/Zend/zend_alloc_canary.c: No such file or directory.

(gdb) bt
#0 0x00000000006b9a7f in zend_mm_remove_from_free_list (heap=0xdd5aa0, mm_block=0x15c28738)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_alloc_canary.c:889

#1 0x00000000006ba550 in _zend_mm_free_canary_int (heap=0xdd5aa0, p=0x15c2a2c0)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_alloc_canary.c:2127

#2 0x00000000006a7fa6 in zend_hash_destroy (ht=0x15b241c8) at /home/akosiaris/php5/php5-5.3.10/Zend/zend_hash.c:734
#3 0x00000000006bb759 in zend_object_std_dtor (object=0x15be1cf8)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_objects.c:45

#4 0x00000000006bb779 in zend_objects_free_object_storage (object=0x15be1cf8)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_objects.c:126

#5 0x00000000006bf2cf in zend_objects_store_free_object_storage (objects=0xdd42d8)

at /home/akosiaris/php5/php5-5.3.10/Zend/zend_objects_API.c:92

#6 0x000000000068c283 in shutdown_executor () at /home/akosiaris/php5/php5-5.3.10/Zend/zend_execute_API.c:304
#7 0x000000000069a355 in zend_deactivate () at /home/akosiaris/php5/php5-5.3.10/Zend/zend.c:963
#8 0x00000000006470a0 in php_request_shutdown (dummy=0xdd5aa0) at /home/akosiaris/php5/php5-5.3.10/main/main.c:1664
#9 0x000000000042b915 in main (argc=32767, argv=0x7fffffffe7ba)

at /home/akosiaris/php5/php5-5.3.10/sapi/cli/php_cli.c:1367

(gdb)

Faidon bumped core limit to 2GB so we can get a full core file (~650MB). Backtrace is:

Program terminated with signal 11, Segmentation fault.
#0 0x00000000006b9e9f in zend_mm_remove_from_free_list (heap=0x12e0b20, mm_block=0x15beb618) at /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c:889
889 /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c: No such file or directory.

#0 0x00000000006b9e9f in zend_mm_remove_from_free_list (heap=0x12e0b20, mm_block=0x15beb618) at /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c:889
#1 0x00000000006ba985 in _zend_mm_free_canary_int (heap=0x12e0b20, p=0x15beb5a8) at /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c:2122
#2 0x00000000006a83c6 in zend_hash_destroy (ht=0x15beb2e0) at /tmp/buildd/php5-5.3.10/Zend/zend_hash.c:734
#3 0x00000000006bbb79 in zend_object_std_dtor (object=0x15bb2d78) at /tmp/buildd/php5-5.3.10/Zend/zend_objects.c:45
#4 0x00000000006bbb99 in zend_objects_free_object_storage (object=0x15bb2d78) at /tmp/buildd/php5-5.3.10/Zend/zend_objects.c:126
#5 0x00000000006bf6ef in zend_objects_store_free_object_storage (objects=0xdd52d8) at /tmp/buildd/php5-5.3.10/Zend/zend_objects_API.c:92
#6 0x000000000068c6a3 in shutdown_executor () at /tmp/buildd/php5-5.3.10/Zend/zend_execute_API.c:304
#7 0x000000000069a775 in zend_deactivate () at /tmp/buildd/php5-5.3.10/Zend/zend.c:963
#8 0x00000000006474c0 in php_request_shutdown (dummy=0x12e0b20) at /tmp/buildd/php5-5.3.10/main/main.c:1664
#9 0x000000000042baa5 in main (argc=32767, argv=0x7fff166f1533) at /tmp/buildd/php5-5.3.10/sapi/cli/php_cli.c:1367

Core file is available at tin.eqiad.wmnet:/home/hashar/bug62623-2.core

The procedure in comment #1 didn't work for me, but I was able to reproduce it on gallium with:

su -s/bin/bash jenkins-slave

$ cd /srv/ssd/jenkins-slave/workspace/mediawiki-core-phpunit-databaseless/tests/phpunit
$ php phpunit.php --with-phpunitdir /srv/deployment/integration/phpunit/vendor/phpunit/phpunit --exclude-group Database,Broken,ParserFuzz,Stub --conf /srv/ssd/jenkins-slave/workspace/mediawiki-core-phpunit-databaseless/LocalSettings.php

Note that /srv/ssd/jenkins-slave/workspace/mediawiki-core-phpunit-databaseless/ is a Jenkins workspace and it would be a different code base whenever a new instance of the job is running on this machine.

The LocalSettings.php has a bunch of code snippets injected at the bottom. They come from integration/jenkins.git in mediawiki/conf.d/* . Maybe some PHP there might be the root cause.

(In reply to Tim Starling from comment #12)

$ cd
/srv/ssd/jenkins-slave/workspace/mediawiki-core-phpunit-databaseless/tests/
phpunit

But of course, that only worked for a few minutes, and then the source tree was replaced with some other patchset and it stopped working. I still don't have a reliable reproduction procedure.

(In reply to Antoine "hashar" Musso from comment #11)

Core file is available at tin.eqiad.wmnet:/home/hashar/bug62623-2.core

It says that it crashed while freeing the property table an XMPReader object, specifically the bucket for the extendedXMPOffset property. The bucket was apparently valid and had a valid allocation header, but when the allocator tried to merge the newly-freed item with the next item in the region, it found that that next item had garbage in its header.

With a deterministic reproduction procedure, it would be possible to set a hardware watchpoint on the corrupt allocation header, and see what writes to that memory location over the course of the program. That may allow us to identify the source of the dangling pointer.

demon added a comment.Mar 25 2014, 2:01 AM

(In reply to Tim Starling from comment #14)

(In reply to Tim Starling from comment #12)

$ cd
/srv/ssd/jenkins-slave/workspace/mediawiki-core-phpunit-databaseless/tests/
phpunit

But of course, that only worked for a few minutes, and then the source tree
was replaced with some other patchset and it stopped working. I still don't
have a reliable reproduction procedure.

How about cloning the repo to somewhere else in the same /srv/ssd/jenkins-slave/workspace/ folder and then check out the offending commit?

$ cd /srv/ssd
$ git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git \

--reference=/srv/ssd/gerrit/mediawiki/core.git bug62623
Cloning into 'bug62623'...

$ git fetch origin refs/changes/45/77645/28
$ git checkout FETCH_HEAD -b 77645,28

Then installed MediaWiki using sqlite:

$ WORKSPACE="pwd" /srv/deployment/integration/slave-scripts/bin/mw-install-sqlite.sh

Applied custom settings:

$ WORKSPACE="pwd" /srv/deployment/integration/slave-scripts/bin/mw-apply-settings.sh

Then running the tests:

$ ulimit -c 2097152
$ cd tests/phpunit/
$ php phpunit.php --with-phpunitdir /srv/deployment/integration/phpunit/vendor/phpunit/phpunit \

 --exclude-group Database,Broken,ParserFuzz,Stub \
--conf /srv/ssd/bug62623/LocalSettings.php

Some tests fails due to permission errors with the files /tmp/mw-UIDGenerator-UID-* , but it does segfault.

$ gdb /usr/bin/php /srv/ssd/bug62623/tests/phpunit/core
Program terminated with signal 11, Segmentation fault.
#0 0x00000000006b9e9f in zend_mm_remove_from_free_list (heap=0x1269ac0, mm_block=0x15b2de50)

at /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c:889

889 /tmp/buildd/php5-5.3.10/Zend/zend_alloc_canary.c: No such file or directory.
$

I spent a while looking at this, but I haven't managed to figure out what's going on yet. Unfortunately, the affected pointer address changes from run to run, so I haven't managed to set a watchpoint on it. I thought for a while that XMPReader was directly involved, since the crash always occurred while destroying an XMPReader, but removing its test caused the segfault to happen elsewhere.

I discovered the checkpoint feature in gdb, and used it to set a watchpoint on a consistent pointer location. It showed an XMPReader->xmlParser property being freed during garbage collection, and then the same zval was decref'd later. The reference count was in a memory location now used for an allocator header.

It appears to be a garbage collection bug: the garbage collector scans arrays and object properties for live zval pointers, but it has no way to scan inside resources. XMPReader stores a reference to itself inside an XML resource, by passing $this to xml_set_element_handler() and xml_set_character_data_handler(). $this has an xmlParser property pointing to the resource, completing the circle. After the XMPReader goes out of scope in its caller, only the resource reference would be keeping it alive -- but the garbage collector doesn't know about that.

I did see a zend_mm_panic() even after disabling the XMPReader tests, but it's possible that another test case hit the same GC bug. I haven't looked into it.

Setting zend.enable_gc=0 should be a suitable workaround -- it worked for me.

Another patch segfaulting is https://gerrit.wikimedia.org/r/#/c/106517/

Jenkins output:
https://integration.wikimedia.org/ci/job/mediawiki-core-phpunit-databaseless/22992/console

I have grabbed the core file from it, that is the same zend_mm_remove_from_free_list segfault

One can add -d zend.enable_gc=0 in the shell scripts which wraps around phpunit. The are located in integration/jenkins.git :

bin/mw-run-phpunit.sh
bin/mw-run-phpunit-allexts.sh

Then push to Gerrit and get it merged. Finally head to tin.eqiad.wmnet:/srv/deployment/integration/slave-scripts and git-deploy the change.

Change 121697 had a related patch set uploaded by CSteipp:
Disable gc during tests

https://gerrit.wikimedia.org/r/121697

Should we file a separate bug for the XMPReader issue?

I spent some more time on this today.

In zend_objects_store_call_destructors() there is this:

obj->refcount++;
obj->dtor(obj->object, i TSRMLS_CC);
obj = &objects->object_buckets[i].bucket.obj;
obj->refcount--;

XMPReader::__destruct() is called here, and frees the last reference to itself by freeing the XML resource, so obj->refcount-- reduces the reference count to zero. This is not checked for, and so the object is not freed. Normally, objects have their reference count decremented in the del_ref handler (zend_objects_store_del_ref_by_handle_ex()), and if it reaches zero, the free_storage handler is called and zend_object_store_bucket.valid is set to zero.

A zero reference count on a live object should be pretty harmless, even though the del_ref handler doesn't guard against it, because there really are no references, so there's a limited number of ways to access it. But in this case, it appears that the zero reference count causes the GC to not free the object, but it does free the object's xmlReader property. The object bucket is still valid, so zend_objects_store_free_object_storage() does a double free.

Although the bug may not be actually in the GC, it does seem to require the GC to run at a very specific time in the shutdown process. So a possible alternative workaround would be something like:

  • a/tests/phpunit/phpunit.php

+++ b/tests/phpunit/phpunit.php
@@ -114,4 +114,9 @@ if ( PHPUnit_Runner_Version::id() !== '@package_version@'
if ( !class_exists( 'PHPUnit_TextUI_Command' ) ) {

require_once 'PHPUnit/Autoload.php';

}
+register_shutdown_function( function() {
+ gc_collect_cycles();
+ gc_disable();
+} );
+
MediaWikiPHPUnitCommand::main();

Change 123146 had a related patch set uploaded by CSteipp:
Prevent GC during phpunit shutdown

https://gerrit.wikimedia.org/r/123146

I was finally able to reproduce the segfault on a new vagrant instance. The steps to reproduce it are below.

But since I was able to accurately test the fixes:

  • gerrit 121697 fixes the issue. Memory use (as reported by phpunit) increased from 596.75Mb to 610.25Mb without the gc running.
  • Adding the patch to phpunit.php (gerrit 123146) also fixed the issue, and memory use stayed at 596.75Mb

So I'm guessing we should go with gerrit 123146, since that seems to prevent using any extra memory during the test runs?

Install mediawiki vagrant:
$ git clone https://gerrit.wikimedia.org/r/mediawiki/vagrant
$ cd vagrant
$ vagrant up

ssh into the instance and:

get all the packages needed to run all of the databaseless tests

$ sudo apt-get install djvulibre-bin php5-tidy

$ mkdir /srv/deployment
$ mkdir /srv/deployment/integration
$ mkdir /srv/deployment/src
$ mkdir /srv/deployment/src/mediawiki/
$ chown -R vagrant:vagrant /srv/deployment

$ cd /srv/deployment/integration
$ git clone https://gerrit.wikimedia.org/r/p/integration/phpunit.git
$ ln -s /vagrant/mediawiki /srv/deployment/src/mediawiki/core

$ cd ~
$ git clone https://gerrit.wikimedia.org/r/p/integration/jenkins.git
$ ln -s /home/vagrant/jenkins /srv/deployment/integration/slave-scripts

  1. Everything is now setup enough like jenkins that the
  2. integration/jenkins scripts will run

Checkout the patch causing the segfault

$ cd /vagrant/mediawiki
$ git fetch https://gerrit.wikimedia.org/r/mediawiki/core refs/changes/45/77645/28 && git checkout FETCH_HEAD

Run everything like Jenkins does

$ cd ~/jenkins/bin
$ export WORKSPACE=/srv/deployment
$ ./mw-apply-settings.sh
$ ./mw-run-phpunit.sh databaseless

Change 121697 abandoned by CSteipp:
Disable gc during tests

Reason:
Ie59780e37cbea027cc204a43df406667c3f432ab seems like a slightly better way to do it (https://bugzilla.wikimedia.org/show_bug.cgi?id=62623#c25)

https://gerrit.wikimedia.org/r/121697

Change 123146 merged by jenkins-bot:
Prevent GC during phpunit shutdown

https://gerrit.wikimedia.org/r/123146

Change 127485 had a related patch set uploaded by Bartosz Dziewoński:
Prevent GC during phpunit shutdown

https://gerrit.wikimedia.org/r/127485

Change 127485 merged by jenkins-bot:
Prevent GC during phpunit shutdown

https://gerrit.wikimedia.org/r/127485

Backported the patch to MediaWiki 1.23.

Change 119225 abandoned by Faidon Liambotis:
contint: set up and maintain a coredumps directory

https://gerrit.wikimedia.org/r/119225

Might be occuring again with Trusty / Zend 5.5. Though it is most probably a different issue T142158