Page MenuHomePhabricator

Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11)
Closed, ResolvedPublic

Description

Since yesterday quibble-vendor-mysql-hhvm-docker has been failing: See https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/36044/console

Relevant parts:

09:39:45 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--debug-tests', '--testsuite', 'extensions', '--exclude-group', 'Broken,ParserFuzz,Stub,Database', '--log-junit', '/workspace/log/junit-dbless.xml']' returned non-zero exit status -11

Googling seems to indicate that -11 usually means segmentation fault (aka crash).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Yeah, this has been failing a lot of late.

I haven't seen a working build for CX since yesterday, so I would say it is always failing now.

And yet a bunch of things have passed these tests and merged in the gate, so it's not 100%:

Looking this morning I see 2 builds out of the past 30 (for quibble-vendor-mysql-docker) that have failed in his way:

Those two builds:

  • happened on 2 difference test executors
  • happened at 2 different points in the job run (although both on PHPUnit, but a different set of PHPUnit tests)

I'm unaware of any recent changes to those jobs or the image of the container that runs the jobs (adding @hashar to correct me if I'm wrong about this).

I speculated a bit about parallel execution yesterday. That is multiple of these jobs running on the same docker host at the same time, competing for resources. If that is the problem, we'd need to figure out how to set some kind of affinity for certain resource-intensive jobs. That is, most nodes in CI have 5 executors; however, most jobs in CI are small, so 5 jobs running on a node is fine if 4 of them are linting and 1 of them is a quibble-vendor-mysql-docker job, but not the other way around.

Are there other things that might explain this recent spike knowing that nothing has changed on the CI side? i.e., are there recent test changes that might have increased resource use?

greg lowered the priority of this task from Unbreak Now! to High.Feb 21 2019, 8:18 PM
greg added a subscriber: greg.

Looking this morning I see 2 builds out of the past 30 (for quibble-vendor-mysql-docker) that have failed in his way:

Those two builds:

  • happened on 2 difference test executors
  • happened at 2 different points in the job run (although both on PHPUnit, but a different set of PHPUnit tests)

Reducing to High per the above.

@thcipriani wrote:
I'm unaware of any recent changes to those jobs or the image of the container that runs the jobs (adding @hashar to correct me if I'm wrong about this).

A CI build is similar to dancing on a mine field, wearing a blind fold while an earthquake is going on. You never now where you are going to land. Or to say otherwise: there are lot of potential factors. Some ideas:

  • some kernel ulimit or docker limit is reached (I would blame memory limit)

From integration grafana board seems the memory on instances was fine. Then I think it aggregates per average so a spike is not shown. Regardless the instances have 32GB RAM which is hmm a lot. So I am more tempted by a per process memory limit.

  • the quibble hhvm jobs got moved from Jessie to Stretch.

I haven't looked at that today. I don't even know whether production has migrated the HHVM MediaWiki servers from Jessie to Stretch. I have not verified whether the HHVM versions matches between the Quibble containers for Jessie and Stretch.

And -11 would be a segmentation fault.

Anyway for the build https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/36097/console which ran on integration-slave-docker-1043 and died at Feb 21 15:31:39, /var/log/kern.log gives me:

Feb 21 15:31:39 integration-slave-docker-1043 kernel: [7943146.540511] php[14610]: segfault at 7f1b16ffad13 ip 00007f1b64787c5e sp 00007f1b53d19d30 error 4 in libpthread-2.24.so[7f1b64780000+18000]

Not sure why the kernel claims it is php for a hhvm build.


Hints on top of my head:

  • rollback the Quibble HHVM jobs to the jessie based container?
  • reach out to SRE to find out which OS production is using for the HHVM mediawiki app servers. But I think it is Stretch now and regardless we have the same HHVM version on both.
  • try to reproduce at home with Docker and the same container?
  • once we got a repro, get the hhvm version, and invoke the gdb gods (really: we need a stacktrace)
  • grep build logs on contint1001 to find out a potential pattern. Though we only have 15 days of logs iirc

Bonus: it would be nice to check Wikimedia cluster production kernel logs. There might be similar segfaults floating around.

For the mediawiki app servers, there are barely any segfault in production (beside ploticus). So at least production is fine.

As for the kernel log reporting php[xxx]:

integration-slave-docker-1048:~$ sudo docker run --rm -it --entrypoint=bash docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-1
nobody@67d2d2e2183a:/workspace$ which php
/usr/bin/php
nobody@67d2d2e2183a:/workspace$ readlink -f /usr/bin/php
/usr/bin/hhvm
nobody@67d2d2e2183a:/workspace$

I guess it does not bother following the symlink and just reuse the invoked name (php).


Imho we are stuck at being able to reproduce the issue. At least ContentTranslation seems to reliably segfault.

Joe added a subscriber: Joe.Feb 22 2019, 5:50 AM

And -11 would be a segmentation fault.

One of the reasons why HHVM could die is a bad version of some extension is loaded. Like an old version of luasandbox or wikidiff.

  • rollback the Quibble HHVM jobs to the jessie based container?

This makes no sense.

  • reach out to SRE to find out which OS production is using for the HHVM mediawiki app servers. But I think it is Stretch now and regardless we have the same HHVM version on both.

Yes, we're using stretch. And I'm pretty sure if the problem is an HHVM bug and not a configuration bug, it will be easy to reproduce in production.

  • try to reproduce at home with Docker and the same container?

I doubt the problem you're seeing has to do with CI but rather with the image build process, but that wouldn't hurt for sure. Have you checked the logs on the server where this was running? Any message about exceeding capabilities?

  • once we got a repro, get the hhvm version, and invoke the gdb gods (really: we need a stacktrace)

I think there are some steps you can follow before getting to gdb, but yes a repro case would be good.

  • grep build logs on contint1001 to find out a potential pattern. Though we only have 15 days of logs iirc

Bonus: it would be nice to check Wikimedia cluster production kernel logs. There might be similar segfaults floating around.

The segfaults seem to have disappeared, did someone change something?

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/486008 is failing with

07:41:35 HeadlessChrome 71.0.3578 (Linux 0.0.0) LOG: 'Exception in module-execute in module wikibase.lexeme.widgets.InvalidLanguageIndicator:'
07:41:35 HeadlessChrome 71.0.3578 (Linux 0.0.0) WARN: TypeError: Cannot read property 'url' of null

I have not seen this failure on other patches, so it's not clear whether it is unrelated to the code changes or not.

07:41:35 HeadlessChrome 71.0.3578 (Linux 0.0.0) LOG: 'Exception in module-execute in module wikibase.lexeme.widgets.InvalidLanguageIndicator:'
07:41:35 HeadlessChrome 71.0.3578 (Linux 0.0.0) WARN: TypeError: Cannot read property 'url' of null

I have not seen this failure on other patches, so it's not clear whether it is unrelated to the code changes or not.

Filed as T217627: Merge blocker: 'Exception in module-execute in module wikibase.lexeme.widgets.InvalidLanguageIndicator: as I saw it on other CX patches too.

If I am reading it right, the segfault is now affecting the database and ffmpeg as well: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/37943/console

Nikerabbit renamed this task from quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11) to Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11).Mar 5 2019, 8:41 AM

If I am reading it right, the segfault is now affecting the database and ffmpeg as well: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/37943/console

On further investigation, I have reported this separately as T217654: Merge blocker: The table 'l10n_cache' is full in quibble-vendor-mysql-hhvm-docker

It is a segfault in HHVM. The issue boils down to reproduce it and capture a stacktrace/core file. HHVM supposedly is able to record stacktrace but I don't know the settings for it nor whether it is enabled.

Supposedly, using the same container and repositories then running the PHPUnit test suite should eventually lead to a reproduction. I haven't found time to do that myself though :/

Supposedly, using the same container and repositories then running the PHPUnit test suite should eventually lead to a reproduction

It reproduces for me several times a day now :( Something new that is triggering it has been added recently, it didn't happen before mid-February or so, at least not for the builds I've seen, and now it's happening almost on every extension patch.

I have looked at all build logs on Jenkins. The only job having the segfault is quibble-vendor-mysql-hhvm-docker.

They are all failing due to PHPUnit triggering a segfault (either with our without @group Database). The MediaWiki test suite logs to mw-debug-cli.log and the last line for each of the builds:

[PHPUnitCommand] END suite
[PHPUnitCommand] END suite
[PHPUnitCommand] END suite CirrusSearch\Search\RescoreBuilderTest
[PHPUnitCommand] END suite Wikibase\Repo\Tests\ParserOutput\PlaceholderExpander\ExternallyRenderedEntityViewPlaceholderExpanderTest
[PHPUnitCommand] End test AbuseFilterConsequencesTest::testMediaWikiTestCaseParentSetupCalled
[PHPUnitCommand] End test AutoLoaderStructureTest::testPSR4Completeness with data set #400
[PHPUnitCommand] End test CirrusSearch\Profile\SearchProfileServiceFactoryTest::testSaneDefaults with data set "crossproject block order"
[PHPUnitCommand] End test CirrusSearch\SearcherTest::testSearchText with data set "browsertest_007-default"
[PHPUnitCommand] End test CirrusSearch\SearcherTest::testSearchText with data set "insource_001-default"
[PHPUnitCommand] End test LessFileCompilationTest::testLessFileCompilation
[PHPUnitCommand] End test MobileFormatterTest::testHtmlTransform with data set #28
[PHPUnitCommand] End test ResourcesTest::testFileExistence with data set #1133
[PHPUnitCommand] End test ResourcesTest::testFileExistence with data set #2510
[PHPUnitCommand] End test Scribunto_LuaUstringLibraryTest::testLua with data set #141
[PHPUnitCommand] End test SpecialPageFatalTest::testSpecialPageDoesNotFatal with data set "Shortpages"
[PHPUnitCommand] End test Wikibase\Lexeme\Tests\MediaWiki\Api\AddFormTest::testGivenInvalidParameter_errorIsReturned with data set "Lexeme is not found"
[PHPUnitCommand] End test Wikibase\Lib\Tests\Formatters\PropertyValueSnakFormatterTest::testFormatSnak with data set "UnDeserializableValue, fail"
[PHPUnitCommand] Start test AbuseFilterParserTest::testExpectedNotFoundException with data set #5
[PHPUnitCommand] Start test AutoLoaderStructureTest::testAutoLoadConfig
[PHPUnitCommand] Start test AutoLoaderStructureTest::testAutoloadOrder
[PHPUnitCommand] Start test AutoLoaderStructureTest::testPSR4Completeness with data set #178
[PHPUnitCommand] Start test AutoLoaderStructureTest::testPSR4Completeness with data set #59
[PHPUnitCommand] Start test Capiunto\Test\BasicRowTest::testOutput
[PHPUnitCommand] Start test Capiunto\Test\InfoboxRenderModuleTest::testLua with data set #9
[PHPUnitCommand] Start test CirrusSearch\LanguageDetectTest::testTextCatDetector with data set #0
[PHPUnitCommand] Start test CirrusSearch\Maintenance\ScriptsRunnableTest::testScriptCanBeLoaded with data set #11
[PHPUnitCommand] Start test CirrusSearch\SearcherTest::testSearchText with data set "browsertest_107-fullyfeatured"
[PHPUnitCommand] Start test CirrusSearch\SearcherTest::testSearchText with data set "browsertest_198-fullyfeatured"
[PHPUnitCommand] Start test ExtensionJsonValidationTest::testPassesValidation with data set #37
[PHPUnitCommand] Start test ExtensionJsonValidationTest::testPassesValidation with data set #40
[PHPUnitCommand] Start test MediaWiki\MassMessage\CategorySpamlistLookupTest::testGetTargets
[PHPUnitCommand] Start test ParserIntegrationTest::testParse with data set "parserTests.txt: XSS is escaped (inline)"
[PHPUnitCommand] Start test Scribunto_LuaCommonTest::testLua with data set #2
[PHPUnitCommand] Start test Scribunto_LuaHtmlLibraryTest::testLua with data set #49
[PHPUnitCommand] Start test Scribunto_LuaLanguageLibraryTest::testLua with data set #28
[PHPUnitCommand] Start test Scribunto_LuaLanguageLibraryTest::testLua with data set #31
[PHPUnitCommand] Start test Scribunto_LuaSandboxInterpreterTest::testTimeLimit
[PHPUnitCommand] Start test Scribunto_LuaSandboxTest::testArgumentParsingTime
[PHPUnitCommand] Start test Scribunto_LuaSiteLibraryTest::testLua with data set #14
[PHPUnitCommand] Start test Scribunto_LuaStandaloneTest::testLua with data set #10
[PHPUnitCommand] Start test Scribunto_LuaTextLibraryTest::testLua with data set #23
[PHPUnitCommand] Start test Scribunto_LuaTextLibraryTest::testLua with data set #25
[PHPUnitCommand] Start test Scribunto_LuaTextLibraryTest::testLua with data set #58
[PHPUnitCommand] Start test Scribunto_LuaTextLibraryTest::testLua with data set #67
[PHPUnitCommand] Start test Scribunto_LuaTextLibraryTest::testLua with data set #70
[PHPUnitCommand] Start test Scribunto_LuaTitleLibraryTest::testLua with data set #34
[PHPUnitCommand] Start test Scribunto_LuaTitleLibraryTest::testLua with data set #36
[PHPUnitCommand] Start test Scribunto_LuaTitleLibraryTest::testLua with data set #40
[PHPUnitCommand] Start test Scribunto_LuaTitleLibraryTest::testLua with data set #48
[PHPUnitCommand] Start test Scribunto_LuaTitleLibraryTest::testMediaWikiTestCaseParentSetupCalled
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #145
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #179
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #205
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #269
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #28
[PHPUnitCommand] Start test Scribunto_LuaUriLibraryTest::testLua with data set #49
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryTest::testLua with data set #154
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryTest::testLua with data set #28
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryTest::testLua with data set #30
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryTest::testLua with data set #95
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryTest::testLua with data set #99
[PHPUnitCommand] Start test SpecialPageFatalTest::testSpecialPageDoesNotFatal with data set "Statistics"
[PHPUnitCommand] Start test SpecialPageFatalTest::testSpecialPageDoesNotFatal with data set "UnconnectedPages"
[PHPUnitCommand] Start test Tests\MediaWiki\Minerva\SkinMinervaTest::testPrepareUserButton with data set #3
[PHPUnitCommand] Start test Wikibase\Client\Tests\DataAccess\Scribunto\Scribunto_LuaWikibaseEntityLibraryTest::testLua with data set #40
[PHPUnitCommand] Start test Wikibase\Client\Tests\DataAccess\Scribunto\Scribunto_LuaWikibaseLibraryTest::testEntityExists with data set #1
[PHPUnitCommand] Start test Wikibase\Lexeme\Tests\MediaWiki\Api\LexemeEditEntityTest::testGivenExistingFormAndAddingFormRepresentation_formPropertyIsUpdated
[PHPUnitCommand] Start test Wikibase\Lib\Tests\Formatters\ItemPropertyIdHtmlLinkFormatterTest::testPropertyDoesNotHaveLabelInUserLanguage_ResultingLinkUsesIdAsAText
[PHPUnitCommand] Start test Wikibase\Lib\Tests\LanguageFallbackChainFactoryTest::testNewFromLanguage with data set #8
[PHPUnitCommand] Start test Wikibase\Lib\Tests\Store\Sql\TermSqlIndexTest::testDeleteTermsForEntity_entitySourceBasedFederation
[PHPUnitCommand] Start test WikibaseQuality\ConstraintReport\Tests\Maintenance\ImportConstraintEntitiesTest::testImportEntityFromJson
[PHPUnitCommand] Start test Wikibase\Repo\Tests\Api\ApiUserBlockedTest::testBlock with data set #6
[PHPUnitCommand] Start test Wikibase\Repo\Tests\Api\GetEntitiesTest::testGetEntities with data set #82
[PHPUnitCommand] Start test Wikibase\Repo\Tests\Api\MergeItemsTest::testMergeRequest with data set "labelMerge"
[PHPUnitCommand] Start test Wikibase\Repo\Tests\Api\RemoveQualifiersTest::testRequests
[PHPUnitCommand] Start test Wikibase\Repo\Tests\Api\SetClaimTest::testAddInvalidClaim
[PHPUnitCommand] Start test Wikibase\Repo\Tests\ValidatorBuildersTest::testStringValueValidation with data set "U+000B: Vertical tab"

Which to me mean: HHVM is broken. I am just going to rollback to the previous container (which is Jessie based). At least to confirm/dismiss the container environment.

Change 496162 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "jjb: Switch MW+Parsoid jobs on hhvm to Stretch (quibble-stretch-hhvm)"

https://gerrit.wikimedia.org/r/496162

hashar added a comment.EditedMar 13 2019, 1:41 PM

I have rollbacked the jobs container:

- docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-1
+ docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.28

Also looked in the captured mw-debug-cli.log files, the only job affected is quibble-vendor-mysql-hhvm-docker with 81 occurences out of 1166 builds. There is a total of 3300 build records for *quibble*hhvm* jobs.

next steps

  • see whether HHVM still segfault, would need more than a few hours to confirm.
  • try to reproduce the issue, though it seems to rather infrequent ( 81 / 1166 is roughly 7%).

Change 496162 merged by jenkins-bot:
[integration/config@master] Revert "jjb: Switch MW+Parsoid jobs on hhvm to Stretch (quibble-stretch-hhvm)"

https://gerrit.wikimedia.org/r/496162

HHVM packages versions in the containers:

PackageJessieStretch
hhvm3.18.5+dfsg-1+wmf53.18.5+dfsg-1+wmf8+deb9u1
hhvm-luasandbox2.0.14~jessie12.0.14~stretch2
hhvm-tidy0.1.3~jessie20.1.3~jessie2+deb9u1
hhvm-wikidiff21.5.11.7.3

Change 496168 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: move HHVM core dump report to workspace

https://gerrit.wikimedia.org/r/496168

I have rollbacked the jobs container:

- docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-1
+ docker-registry.wikimedia.org/releng/quibble-jessie-hhvm:0.0.28

That got rollbacked due to an issue with Karma / Chromium. That jessie container ships with Chromium 59 which is hmm quite old. So that was a bad move.

Change 496168 merged by jenkins-bot:
[integration/config@master] docker: move HHVM core dump report to workspace

https://gerrit.wikimedia.org/r/496168

Mentioned in SAL (#wikimedia-releng) [2019-03-13T18:48:37Z] <hashar> Building containers releng/quibble-jessie-hhvm and releng/quibble-stretch-hhvm with HHVM core_dump_report enabled # T216689

Change 496239 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Enable HHVM crash report for Quibble jobs

https://gerrit.wikimedia.org/r/496239

Change 496239 merged by jenkins-bot:
[integration/config@master] Enable HHVM crash report for Quibble jobs

https://gerrit.wikimedia.org/r/496239

The container seems to have the proper configuration for HHVM:

$ docker run --rm -it --entrypoint=hhvm docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2 --php -i|grep core_dump
hhvm.debug.core_dump_email => 
hhvm.debug.core_dump_report => 1
hhvm.debug.core_dump_report_directory => /workspace/log

I have rebuild some ContentTranslation builds which did cause some segmentation fault but no core dump report got captured. I don't even know whether it got generated.

hashar added a comment.EditedMar 13 2019, 8:25 PM

for the record, I had a while loop on my machine running the phpunit test with the same container and it does not fail. The commands used to reproduce:

#!/bin/bash

install -d -m 777 cache
install -d -m 777 db
install -d -m 777 log
install -d -m 777 src

exec docker run \
	--init \
	--volume /home/hashar/projects:/srv/git:ro \
	--volume "$(pwd)/cache:/cache" \
	--volume "$(pwd):/workspace" \
	--env ZUUL_PROJECT=mediawiki/extensions/ContentTranslation \
	--env ZUUL_URL=https://gerrit.wikimedia.org/r/p \
	--env ZUUL_BRANCH=master \
	docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2 \
	--skip-zuul \
	--skip-deps \
	--packages-source vendor \
	--db mysql \
	--db-dir /workspace/db \
	--dump-db-postrun \
	mediawiki/core mediawiki/skins/Vector mediawiki/vendor mediawiki/extensions/ContentTranslation mediawiki/skins/MinervaNeue mediawiki/extensions/AbuseFilter mediawiki/extensions/AntiSpoof mediawiki/extensions/ArticlePlaceholder mediawiki/extensions/BetaFeatures mediawiki/extensions/Capiunto mediawiki/extensions/CentralAuth mediawiki/extensions/CheckUser mediawiki/extensions/CirrusSearch mediawiki/extensions/Cite mediawiki/extensions/CodeEditor mediawiki/extensions/Echo mediawiki/extensions/EducationProgram mediawiki/extensions/Elastica mediawiki/extensions/EventLogging mediawiki/extensions/GeoData mediawiki/extensions/GuidedTour mediawiki/extensions/JsonConfig mediawiki/extensions/LiquidThreads mediawiki/extensions/MassMessage mediawiki/extensions/MobileApp mediawiki/extensions/MobileFrontend mediawiki/extensions/PdfHandler mediawiki/extensions/PropertySuggester mediawiki/extensions/Renameuser mediawiki/extensions/Scribunto mediawiki/extensions/SiteMatrix mediawiki/extensions/SyntaxHighlight_GeSHi mediawiki/extensions/TemplateData mediawiki/extensions/TimedMediaHandler mediawiki/extensions/TitleBlacklist mediawiki/extensions/UniversalLanguageSelector mediawiki/extensions/UserMerge mediawiki/extensions/VisualEditor mediawiki/extensions/WikiEditor mediawiki/extensions/Wikibase mediawiki/extensions/WikibaseLexeme mediawiki/extensions/WikibaseMediaInfo mediawiki/extensions/WikibaseQualityConstraints mediawiki/extensions/WikimediaBadges mediawiki/extensions/WikimediaEvents mediawiki/extensions/ZeroBanner mediawiki/extensions/ZeroPortal mediawiki/extensions/cldr \
	--run=phpunit

(on first run, one has to remove --skip-zuul and --skip-deps to populate the git repos and install the dependencies.

The while loop has not segfaulted so far. Will keep it running for a while.


Another thing I would like to check is on which hosts the segfault actually occurs. Some instances have more segfaults than others, then it might be because they do not run that specific job.

All segfaults: P8195
Counts: P8196

Status

  • I can not reproduce locally
  • hhvm.debug.core_dump_report does not yield any core report

I could not figure out how to enable the kernel to write core files. In bash that would be ulimit -c, for the Docker container that might be via docker run --ulimit XXXX. XXXX to be figured out :/

At least when issuing kill -SIGSEGV $(pidof php) hhvm yields:

Core dumped: Segmentation fault
Stack trace in /workspace/log/stacktrace.12859.log

The hhvm.ini settings from https://gerrit.wikimedia.org/r/#/c/496168/ thus work properly.

I then have a 1.6GB core file in the current directory of the process (/workspace/src/core).

Also edited my command above ( T216689#5022384 ) which used a php7.2 container instead of HHVM. I ran it again tonight with the hhvm container but no segfault so far :/

I could not reproduce on my machine :-/

I went looking for differences of the HHVM package between Jessie and Stretch:

diff -u -r jessie/changelog stretch/changelog
--- jessie/changelog	2018-03-31 21:12:14.000000000 +0200
+++ stretch/changelog	2018-05-08 11:42:37.000000000 +0200
@@ -1,3 +1,23 @@
+hhvm (3.18.5+dfsg-1+wmf8+deb9u1) stretch-security; urgency=medium
+
+  * Backport security fixes from 3.21.11
+
+ -- Moritz Muehlenhoff <moritz@wikimedia.org>  Tue, 08 May 2018 09:42:37 +0000
+
+hhvm (3.18.5+dfsg-1+wmf7+deb9u1) stretch-wikimedia; urgency=medium
+
+  * Bump version to wmf7 to align with jessie builds
+  * Update memcached module to set MEMC_VAL_COMPRESSION_ZLIB flag
+    (Bug: T184854)
+
+ -- Moritz Muehlenhoff <moritz@wikimedia.org>  Tue, 10 Apr 2018 10:15:28 +0000
+
+hhvm (3.18.5+dfsg-1+wmf5+deb9u1) stretch-wikimedia; urgency=medium
+
+  * Rebuild for stretch-security
+
+ -- Moritz Muehlenhoff <moritz@wikimedia.org>  Tue, 03 Apr 2018 06:38:42 +0000
+
 hhvm (3.18.5+dfsg-1+wmf5) jessie-wikimedia; urgency=medium
 
   * CVE-2018-6334
Only in stretch/patches: 0b1b18312f1d2f8e1c921c684cdb522ab2c47770.patch
Only in stretch/patches: 2255701d3f92e855d92332cda624bdaecd38e907.patch
Only in stretch/patches: f3596e417629c1209d4ab6b0345a34a7cae9d47e.patch
Only in stretch/patches: memcached-compat.patch
diff -u -r jessie/patches/series stretch/patches/series
--- jessie/patches/series	2018-03-31 21:12:14.000000000 +0200
+++ stretch/patches/series	2018-05-08 11:42:37.000000000 +0200
@@ -17,3 +17,7 @@
 d4c4c2ac6d546f3643c669a4f1f6107b67f15c5b.patch
 f0ed24a119698d200e43dac0f683f8d38d590894.patch
 CVE-2018-6334.patch
+memcached-compat.patch
+0b1b18312f1d2f8e1c921c684cdb522ab2c47770.patch
+f3596e417629c1209d4ab6b0345a34a7cae9d47e.patch
+2255701d3f92e855d92332cda624bdaecd38e907.patch

They do not seem too suspicious:

[security] Bug #73957: signed integer conversion in imagescale()
[security][CVE-2018-6335] Fix potential crash in HTTP2 padding handling
[security] [CVE-2018-5711] Sec Bug #75571: Potential infinite loop in gdImageCreateFromGifCtx
[PATCH] Update memcached to set MEMC_VAL_COMPRESSION_ZLIB flag

Change 496392 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: look and capture 'core' file in Quibble

https://gerrit.wikimedia.org/r/496392

Mentioned in SAL (#wikimedia-releng) [2019-03-14T10:49:23Z] <hashar> triggering tests for all ContentTranslation pending changes # T216689

Mentioned in SAL (#wikimedia-releng) [2019-03-14T12:12:14Z] <hashar> Updated quibble-vendor-mysql-hhvm-docker to hopefully allow core dumps and capture them | https://gerrit.wikimedia.org/r/#/c/integration/config/+/496392/3 # T216689

Mentioned in SAL (#wikimedia-releng) [2019-03-14T12:31:54Z] <hashar> Updated quibble-vendor-mysql-hhvm-docker to hopefully allow core dumps and capture them | https://gerrit.wikimedia.org/r/#/c/integration/config/+/496392/4 # T216689

Something I noticed is Docker got upgraded:

Start-Date: 2019-02-15  11:06:59
Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install docker-ce=18.06.2~ce~3-0~debian
Upgrade: docker-ce:amd64 (17.12.1~ce-0~debian, 18.06.2~ce~3-0~debian)
End-Date: 2019-02-15  11:07:07

I have tweaked the job quibble-vendor-mysql-hhvm-docker to allow capturing core files. They mention lua.

Crash report and php (really hhvm) core file:

Job (kept for ever)https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/39681/
slaveintegration-slave-docker-1040
Containerdocker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2
core file https://people.wikimedia.org/~hashar/T216689/core.606eb29eab46.php.2353.1552570410.bz2
mw-debug-cli.log
[PHPUnitCommand] Start test Scribunto_LuaUstringLibraryPureLuaTest::testLua with data set #3
Parser: using preprocessor: Preprocessor_Hash
[Scribunto] Scribunto_LuaStandaloneInterpreter::__construct: creating interpreter: 'exec' '/bin/sh' '/workspace/src/extensions/Scribunto/includes/engines/LuaStandalone/lua_ulimit.sh' '30' '31' '48828' ''\''/workspace/src/extensions/Scribunto/includes/engines/LuaStandalone/binaries/lua5_1_5_linux_64_generic/lua'\'' '\''/workspace/src/extensions/Scribunto/includes/engines/LuaStandalone/mw_main.lua'\'' '\''/workspace/src/extensions/Scribunto/includes'\'' '\''983'\'' '\''8'\'''

[gitinfo] Computed cacheFile=/workspace/src/gitinfo.json for /workspace/src
[gitinfo] Cache incomplete for /workspace/src
[DBQuery] SELECT  COUNT(*)  FROM `unittest_user_groups`    WHERE ug_group = 'sysop' AND (ug_expiry IS NULL OR ug_expiry >= '20190314133330')  LIMIT 1

Prepare:

wget https://people.wikimedia.org/~hashar/T216689/core.606eb29eab46.php.2353.1552570410.bz2
bunzip2 core.606eb29eab46.php.2353.1552570410.bz2
docker run --rm -it --entrypoint=bash --user=root --volume $(pwd):/coredump docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2

Grab help from https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/How_to_debug_HHVM

in container
apt update && apt -y install gdb hhvm-dbg

Might want to commit that container to avoid having to download hhvm-dbg again later on: docker commit xxxxx t216689-debug

Anyway:

gdb /usr/bin/hhvm /coredump/core.606eb29eab46.php.2353.1552570410
Reading symbols from /usr/bin/hhvm...Reading symbols from /usr/lib/debug/.build-id/bd/b5d44bc260305fe8943ed9341df7c24b9c73df.debug...done.
[New LWP 2354]
[New LWP 2353]
[New LWP 13041]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `php tests/phpunit/phpunit.php --debug-tests --testsuite extensions --exclude-gr'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
813	pthread_create.c: No such file or directory.
[Current thread is 1 (Thread 0x7f55614be3c0 (LWP 2354))]

(gdb) bt
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
#1  0x00007f556f461bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120
#2  0x00007f557214a494 in start_thread (arg=0x7f55614be3c0) at pthread_create.c:456
#3  0x00007f556aeebacf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f55614be3c0, max=<optimized out>)
    at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#4  0x0000000000000000 in ?? ()
(gdb) thread apply all bt

Thread 3 (Thread 0x7f550dbfe700 (LWP 13041)):
#0  0x00007f556aeebac1 in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f550dbfe700, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#1  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f557390d3c0 (LWP 2353)):
#0  HPHP::MixedArray::Elm::setIntKey (h=<optimized out>, k=4, this=0x7f550f587740) at ./hphp/runtime/base/mixed-array.h:107
#1  HPHP::MixedArray::nextInsertWithRef (this=this@entry=0x7f550f5876c0, data=...) at ./hphp/runtime/base/mixed-array.cpp:1178
#2  0x000000000213cecb in HPHP::MixedArray::ArrayMergeGeneric (ret=0x7f550f5876c0, elems=<optimized out>) at ./hphp/runtime/base/mixed-array.cpp:1699
#3  0x000000000213af7e in HPHP::ArrayData::merge (elms=<optimized out>, this=<optimized out>) at ./hphp/runtime/base/array-data-defs.h:420
#4  HPHP::Array::mergeImpl (this=0x7ffdec7acbf0, data=<optimized out>) at ./hphp/runtime/base/type-array.cpp:332
#5  0x00000000017833ba in HPHP::f_array_merge (numArgs=2, array1=..., array2=..., args=...) at ./hphp/runtime/ext/array/ext_array.cpp:531
#6  0x00000000011dfaba in HPHP::Native::callFuncTVImpl (f=<optimized out>, GP=GP@entry=0x7ffdec7acd20, GP_count=<optimized out>, SIMD=SIMD@entry=0x7ffdec7acce0, SIMD_count=SIMD_count@entry=0) at ./hphp/runtime/vm/native-func-caller.h:1564
#7  0x000000000216e746 in HPHP::Native::callFunc<true> (func=0x7f5561e75e80, ctx=<optimized out>, args=<optimized out>, numNonDefault=<optimized out>, ret=...) at ./hphp/runtime/vm/native.cpp:200
#8  0x00000000021d04ee in HPHP::iopFCallBuiltin (id=<optimized out>, numNonDefault=..., numArgs=...) at ./hphp/runtime/vm/bytecode.cpp:4872
#9  HPHP::iopWrapper (pc=<optimized out>, fn=<optimized out>, op=<optimized out>) at ./hphp/runtime/vm/bytecode.cpp:6654
#10 HPHP::dispatchImpl<false> () at ./hphp/runtime/vm/bytecode.cpp:6905
#11 0x0000000000ea591f in HPHP::exception_handler<void (*)()> (action=<optimized out>) at ./hphp/runtime/vm/unwind-inl.h:30
#12 0x0000000000ea611c in HPHP::enterVMCustomHandler<HPHP::enterVM(HPHP::ActRec*, Action) [with Action = HPHP::ExecutionContext::invokeFunc(const HPHP::Func*, const HPHP::Variant&, HPHP::ObjectData*, HPHP::Class*, HPHP::VarEnv*, HPHP::StringData*, HPHP::ExecutionContext::InvokeFlags, bool)::<lambda(HPHP::ActRec*)>::<lambda()>]::<lambda()> > (action=..., ar=0x7f555bbbffc0) at ./hphp/runtime/base/execution-context.cpp:1574
#13 HPHP::enterVM<HPHP::ExecutionContext::invokeFunc(const HPHP::Func*, const HPHP::Variant&, HPHP::ObjectData*, HPHP::Class*, HPHP::VarEnv*, HPHP::StringData*, HPHP::ExecutionContext::InvokeFlags, bool)::<lambda(HPHP::ActRec*)>::<lambda()> > (action=..., 
    ar=0x7f555bbbffc0) at ./hphp/runtime/base/execution-context.cpp:1580
#14 HPHP::ExecutionContext::<lambda(HPHP::ActRec*)>::operator() (ar=0x7f555bbbffc0, __closure=<synthetic pointer>) at ./hphp/runtime/base/execution-context.cpp:1651
#15 HPHP::ExecutionContext::invokeFuncImpl<HPHP::ExecutionContext::invokeFunc(const HPHP::Func*, const HPHP::Variant&, HPHP::ObjectData*, HPHP::Class*, HPHP::VarEnv*, HPHP::StringData*, HPHP::ExecutionContext::InvokeFlags, bool)::<lambda(HPHP::TypedValue&)>, HPHP::ExecutionContext::invokeFunc(const HPHP::Func*, const HPHP::Variant&, HPHP::ObjectData*, HPHP::Class*, HPHP::VarEnv*, HPHP::StringData*, HPHP::ExecutionContext::InvokeFlags, bool)::<lambda(HPHP::ActRec*, HPHP::TypedValue&)>, HPHP::ExecutionContext::invokeFunc(const HPHP::Func*, const HPHP::Variant&, HPHP::ObjectData*, HPHP::Class*, HPHP::VarEnv*, HPHP::StringData*, HPHP::ExecutionContext::InvokeFlags, bool)::<lambda(HPHP::ActRec*)> > (doEnterVM=..., doInitArgs=..., doStackCheck=..., useWeakTypes=false, 
    invName=0x0, argc=<optimized out>, cls=0x0, thiz=0x0, f=0x7f555b561890, this=0x7f5562d52010) at ./hphp/runtime/base/execution-context.cpp:1542
#16 HPHP::ExecutionContext::invokeFunc (this=this@entry=0x7f5562d52010, f=0x7f555b561890, args_=..., thiz=thiz@entry=0x0, cls=cls@entry=0x0, varEnv=0x7f5562d57a90, invName=0x0, flags=HPHP::ExecutionContext::InvokePseudoMain, useWeakTypes=false)
    at ./hphp/runtime/base/execution-context.cpp:1655
#17 0x0000000000ea6357 in HPHP::ExecutionContext::invokeUnit (this=0x7f5562d52010, unit=0x7f555b56ab40) at ./hphp/runtime/base/execution-context.cpp:1272
#18 0x00000000021a0a7c in HPHP::invoke_file_impl (currentDir=0x224b129 "", once=true, path=..., res=...) at ./hphp/runtime/base/builtin-functions.cpp:909
#19 HPHP::invoke_file (s=..., once=once@entry=true, currentDir=currentDir@entry=0x224b129 "") at ./hphp/runtime/base/builtin-functions.cpp:922
#20 0x00000000021a0d64 in HPHP::include_impl_invoke (file=..., once=once@entry=true, currentDir=currentDir@entry=0x224b129 "") at ./hphp/runtime/base/builtin-functions.cpp:948
#21 0x0000000000f0d24b in HPHP::hphp_invoke (context=0x7f5562d52010, cmd="tests/phpunit/phpunit.php", func=func@entry=false, funcParams=..., funcRet=..., reqInitFunc="", reqInitDoc=..., error=<optimized out>, errorMsg=..., once=<optimized out>, 
    warmupOnly=<optimized out>, richErrorMsg=<optimized out>) at ./hphp/runtime/base/program-functions.cpp:2254
#22 0x0000000000f0da5a in HPHP::hphp_invoke_simple (filename="tests/phpunit/phpunit.php", warmupOnly=warmupOnly@entry=false) at ./hphp/runtime/base/program-functions.cpp:2208
#23 0x0000000000f19cc8 in HPHP::execute_program_impl (argc=argc@entry=13, argv=argv@entry=0x7ffdec7ae8c0) at ./hphp/runtime/base/program-functions.cpp:1827
#24 0x0000000000f1ad7e in HPHP::execute_program (argc=13, argv=argv@entry=0x7ffdec7ae8c0) at ./hphp/runtime/base/program-functions.cpp:1148
#25 0x0000000000e9af38 in HPHP::emulate_zend (argc=argc@entry=9, argv=argv@entry=0x7ffdec7aef08) at ./hphp/runtime/base/emulate-zend.cpp:283
#26 0x0000000000a70f33 in main (argc=<optimized out>, argv=0x7ffdec7aef08) at ./hphp/hhvm/main.cpp:65

Thread 1 (Thread 0x7f55614be3c0 (LWP 2354)):
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
#1  0x00007f556f461bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120
#2  0x00007f557214a494 in start_thread (arg=0x7f55614be3c0) at pthread_create.c:456
#3  0x00007f556aeebacf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f55614be3c0, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#4  0x0000000000000000 in ?? ()
(gdb)
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
813	pthread_create.c: No such file or directory.
[Current thread is 1 (Thread 0x7f55614be3c0 (LWP 2354))]

Some searching on that error location brought me to https://bugs.mysql.com/bug.php?id=82886 which pointed me to https://sourceware.org/bugzilla/show_bug.cgi?id=20116. TL;DR: A use-after-free bug in pthread, if the thread exits before the calling thread completes pthread_create(), that calling thread could wind up accessing a freed data structure.

That does sound like a possibility here.

Bug wasn't present in 2.19 (jessie's version of libc6) but is in 2.24 (stretch's version). Debian's libc6 package changelog mentions the fix being backported to stretch in 2.24-11+deb9u4; I see mwmaint1002 still has 2.24-11+deb9u3 while mwdebug1002 has 2.24-11+deb9u4. What version does your docker image have?

Thank you Brad!

The container is docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2 which I have build on 03/13 at 14:00 UTC (eg yesterday). It has:

libc62.24-11+deb9u3

And indeed require some packages to be upgraded:

base-files/stable 9.9+deb9u8 amd64 [upgradable from: 9.9+deb9u6]
chromedriver/stable 72.0.3626.122-1~deb9u1 amd64 [upgradable from: 71.0.3578.80-1~deb9u1]
chromium/stable 72.0.3626.122-1~deb9u1 amd64 [upgradable from: 71.0.3578.80-1~deb9u1]
chromium-driver/stable 72.0.3626.122-1~deb9u1 amd64 [upgradable from: 71.0.3578.80-1~deb9u1]
gpgv/stable 2.1.18-8~deb9u4 amd64 [upgradable from: 2.1.18-8~deb9u3]
libc-bin/stable 2.24-11+deb9u4 amd64 [upgradable from: 2.24-11+deb9u3]
libc-l10n/stable 2.24-11+deb9u4 all [upgradable from: 2.24-11+deb9u3]
libc6/stable 2.24-11+deb9u4 amd64 [upgradable from: 2.24-11+deb9u3]
libcups2/stable 2.2.1-8+deb9u3 amd64 [upgradable from: 2.2.1-8+deb9u2]
libcurl3-gnutls/stable,stable 7.52.1-5+deb9u9 amd64 [upgradable from: 7.52.1-5+deb9u8]
libopenjp2-7/stable 2.1.2-1.1+deb9u3 amd64 [upgradable from: 2.1.2-1.1+deb9u2]
libpq5/stable 9.6.11-0+deb9u1 amd64 [upgradable from: 9.6.10-0+deb9u1]
libssh-gcrypt-4/stable 0.7.3-2+deb9u2 amd64 [upgradable from: 0.7.3-2+deb9u1]
libssl1.0.2/stable 1.0.2r-1~deb9u1 amd64 [upgradable from: 1.0.2q-1~deb9u1]
libsystemd0/stable 232-25+deb9u9 amd64 [upgradable from: 232-25+deb9u8]
libudev1/stable 232-25+deb9u9 amd64 [upgradable from: 232-25+deb9u8]
libwayland-client0/stable 1.12.0-1+deb9u1 amd64 [upgradable from: 1.12.0-1]
libwayland-cursor0/stable 1.12.0-1+deb9u1 amd64 [upgradable from: 1.12.0-1]
libwayland-server0/stable 1.12.0-1+deb9u1 amd64 [upgradable from: 1.12.0-1]
locales/stable 2.24-11+deb9u4 all [upgradable from: 2.24-11+deb9u3]
multiarch-support/stable 2.24-11+deb9u4 amd64 [upgradable from: 2.24-11+deb9u3]
postgresql-9.6/stable 9.6.11-0+deb9u1 amd64 [upgradable from: 9.6.10-0+deb9u1]
postgresql-client-9.6/stable 9.6.11-0+deb9u1 amd64 [upgradable from: 9.6.10-0+deb9u1]

And another trace : https://people.wikimedia.org/~hashar/T216689/core.ad63e66644d3.php.2314.1552572638.bz2

[New LWP 2315]
[New LWP 2314]

warning: Unexpected size of section `.reg-xstate/2315' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `php tests/phpunit/phpunit.php --debug-tests --testsuite extensions --exclude-gr'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/2315' in core file.
#0  0x00007fa234ed0c5e in __pthread_create_2_1 (newthread=newthread@entry=0x7fa226096e18, attr=attr@entry=0x7fa215f22998, 
    start_routine=start_routine@entry=0x7fa2321e7c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
813	pthread_create.c: No such file or directory.
[Current thread is 1 (Thread 0x7fa22609b3c0 (LWP 2315))]

(gdb) bt
#0  0x00007fa234ed0c5e in __pthread_create_2_1 (newthread=newthread@entry=0x7fa226096e18, attr=attr@entry=0x7fa215f22998, 
    start_routine=start_routine@entry=0x7fa2321e7c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
#1  0x00007fa2321e7bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120
#2  0x00007fa234ed0494 in start_thread (arg=0x7fa22609b3c0) at pthread_create.c:456
#3  0x00007fa22dc71acf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7fa22609b3c0, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#4  0x0000000000000000 in ?? ()

So again within pthread_create.

I need the base container docker-registry.wikimedia.org/wikimedia-stretch which is T216384#5024505. Then we can rebuild the chain of containers leading to docker-registry.wikimedia.org/releng/quibble-stretch-hhvm:0.0.28-2.

Mentioned in SAL (#wikimedia-releng) [2019-03-14T16:34:17Z] <hashar> rollback quibble-vendor-mysql-hhvm-docker job to no more capture core files, we have enough and a good lead ( reverting https://gerrit.wikimedia.org/r/#/c/integration/config/+/496392/ ) # T216689

Change 496608 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: rebuild ci-stretch for debian/libc6 update

https://gerrit.wikimedia.org/r/496608

Change 496608 merged by jenkins-bot:
[integration/config@master] docker: rebuild ci-stretch for debian/libc6 update

https://gerrit.wikimedia.org/r/496608

Change 496620 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: update quibble HHVM container for libc update

https://gerrit.wikimedia.org/r/496620

Mentioned in SAL (#wikimedia-releng) [2019-03-14T21:21:20Z] <hashar> Updated quibble-vendor-mysql-hhvm-docker with latest libc6 hopefully fixing HHVM segfault within libpthread # T216689

Mentioned in SAL (#wikimedia-releng) [2019-03-14T21:25:45Z] <hashar> Manually triggered tests for 12 ContentTranslation changes that had label:verified=-1 # T216689

Theoretically the issue should be fixed, at least libc has been upgraded and HHVM should no more trigger the libpthread bug.

The job has been updated around 2019-03-14T21:21:20Z

I have triggered builds for 12 ContentTranslation jobs that were failing and quibble-vendor-mysql-hhvm-docker seems to work for them.

Lets check and monitor and we can probably mark this task as resolved finally.

Also I would like to thanks the language team (and all affected person) for their patience on this topic. It took a while to actually jump in it and as one can see from all my comments, my journey has been quite long on this.

Thank you @Anomie to have digested the cryptic HHVM stacktrace and pointed at the proper solution ( T216689#5024449 ). All of that in less than 17 minutes :]

santhosh added a comment.EditedMar 15 2019, 9:22 AM

Is there a reason why the error is still happening? Two examples from past one hour:

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/39894/console
https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/39896/console

14:42:30 Warning: Destructor threw an object exception: exception 'Wikimedia\Rdbms\DBAccessError' with message 'Database access has been disabled.' in /workspace/src/includes/libs/rdbms/loadbalancer/LoadBalancer.php:1099
14:42:30 Stack trace:
14:42:30 #0 /workspace/src/includes/libs/rdbms/loadbalancer/LoadBalancer.php(924): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection()
14:42:30 #1 /workspace/src/includes/libs/rdbms/loadbalancer/LoadBalancer.php(872): Wikimedia\Rdbms\LoadBalancer->openLocalConnection()
14:42:30 #2 /workspace/src/includes/libs/rdbms/loadbalancer/LoadBalancer.php(748): Wikimedia\Rdbms\LoadBalancer->openConnection()
14:42:30 #3 /workspace/src/includes/GlobalFunctions.php(2654): Wikimedia\Rdbms\LoadBalancer->getConnection()
14:42:30 #4 /workspace/src/extensions/CentralAuth/includes/CentralAuthHooks.php(1509): wfGetDB()
14:42:30 #5 /workspace/src/includes/Hooks.php(174): CentralAuthHooks::onUnitTestsBeforeDatabaseTeardown()
14:42:30 #6 /workspace/src/includes/Hooks.php(202): Hooks::callHook()
14:42:30 #7 /workspace/src/tests/phpunit/MediaWikiTestCase.php(1359): Hooks::run()
14:42:30 #8 /workspace/src/tests/phpunit/bootstrap.php(20): MediaWikiTestCase::teardownTestDB()
14:42:30 #9 (): MediaWikiPHPUnitBootstrap->__destruct()
14:42:30 #10 {main}
14:42:30 Traceback (most recent call last):
14:42:30   File "/usr/local/bin/quibble", line 11, in <module>
14:42:30     load_entry_point('quibble==0.0.0', 'console_scripts', 'quibble')()
14:42:30   File "/usr/local/lib/python3.5/dist-packages/quibble/cmd.py", line 558, in main
14:42:30     cmd.execute()
14:42:30   File "/usr/local/lib/python3.5/dist-packages/quibble/cmd.py", line 530, in execute
14:42:30     junit_file=junit_db_file)
14:42:30   File "/usr/local/lib/python3.5/dist-packages/quibble/test.py", line 196, in run_phpunit_database
14:42:30     run_phpunit(*args, **kwargs)
14:42:30   File "/usr/local/lib/python3.5/dist-packages/quibble/test.py", line 191, in run_phpunit
14:42:30     subprocess.check_call(cmd, cwd=mwdir, env=phpunit_env)
14:42:30   File "/usr/lib/python3.5/subprocess.py", line 271, in check_call
14:42:30     raise CalledProcessError(retcode, cmd)
14:42:30 subprocess.CalledProcessError: Command '['php', 'tests/phpunit/phpunit.php', '--debug-tests', '--testsuite', 'extensions', '--group', 'Database', '--exclude-group', 'Broken,ParserFuzz,Stub', '--log-junit', '/workspace/log/junit-db.xml']' returned non-zero exit status 1
14:42:30 INFO:backend.MySQL:Terminating MySQL

@santhosh that is unrelated to this task.

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/39894/console is a PHPUnit failure:

There was 1 failure:
1) EchoDiscussionParserTest::testAnnotation with data set #4

Got filled as T218388

https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/39896/console also fails due to PHPUnit tests and also for Echo so that might be related to the above:

There were 8 failures:
1) EchoDiscussionParserTest::testSigningDetection with data set #x
...

The trace you mention is a warning. There is some code in the test suite or in CentralAuth that could use a fix. That is worth filling another task I guess ;-]

The job has been updated around 2019-03-14T21:21:20Z

The last failing build has been #39845 on March 14th at 21:16:00Z

The libc package update fixed it, the container was simply faulty.

  • rollback the Quibble HHVM jobs to the jessie based container?

This makes no sense.

The container change was the actual reason, so that totally made sense. I did try to rollback but ended up hitting an entirely different problem due to the outdated Chromium.

Thanks @hashar and @Anomie for fixing this :)

hashar closed this task as Resolved.Mar 18 2019, 12:01 PM
hashar claimed this task.

The job has been updated around 2019-03-14T21:21:20Z

The last failing build has been #39845 on March 14th at 21:16:00Z
The libc package update fixed it, the container was simply faulty.

And this morning, build 39845 is still the last one to have existed with returned non-zero exit status -11. So the libc6 upgrade definitely fixed it. Thank you @Anomie

I have published a quick post about the debugging session I did last week. That might be helpful as a reference in the future:
J152 : Blog Post: Help my CI job fails with exit status -11.

-Thank you everyone that has shown a token appreciation on this task and elsewhere. Much appreciated-

Change 496392 abandoned by Hashar:
jjb: Quibble jobs to capture core files if any

Reason:
That was a one off effort to debug a segfault T216689

https://gerrit.wikimedia.org/r/496392

Change 496620 abandoned by Hashar:
jjb: update quibble HHVM container for libc update

Reason:
obsolete :)

https://gerrit.wikimedia.org/r/496620

mmodell changed the subtype of this task from "Task" to "Production Error".Wed, Aug 28, 11:07 PM