Description
Details
Event Timeline
The problem reported in T161698 blocks merging any of Wikibase patches, so this is becoming more pressing for Wikidata folk. Is there anything on the Wikibase side we could do to help fixing the issue?
We should try and make scribuntu more performant i.e. performance wise. Or at least skip the tests that cause the tests to take so long.
As I said earlier, it seems this Scribunto failure is a symptom of a problem, not the cause. The high-level cause seems to be that the test suite as a whole is exceeding a memory limitation of some sort. What limitation that is, exactly, and whether there's a configuration setting that could be changed or a leak somewhere that could be fixed is a matter for further investigation.
Mentioned in SAL (#wikimedia-releng) [2017-04-03T20:39:03Z] <hashar> Nodepool: holding instance ci-trusty-wikimedia-597386 in an attempt debug Wikibase/Scribunto memory usage exploding T125050
I manually rebuild mwext-testextension-php55-composer-trusty for Wikibase. Instructed nodepool to not delete the instance (ssh labnodepool1001.eqiad.wmnet nodepool hold <ID of instance) and connected to it while the test was running.
The vmstat at the end of the build. Memory metrics are in megabytes, lines one second apart:
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 602 47 957 0 0 0 340 2777 6007 49 3 48 1 0 1 0 0 591 47 957 0 0 0 288 2806 6929 48 4 48 1 0 1 0 0 571 47 957 0 0 48 32 4147 9573 48 4 47 1 0 2 0 0 527 47 957 0 0 0 36 6498 8514 50 5 45 0 0 1 0 0 562 47 959 0 0 0 372 7061 11920 46 10 43 1 0 1 0 0 562 47 959 0 0 0 20 1248 2536 48 3 49 0 0 1 0 0 562 47 959 0 0 0 0 365 116 48 2 50 0 0 1 0 0 562 47 959 0 0 0 7296 483 954 49 3 48 1 0 1 0 0 562 47 959 0 0 0 9748 488 889 50 2 48 1 0 1 0 0 2405 47 959 0 0 4 404 1136 1913 36 8 55 0 0 0 0 0 2396 47 962 0 0 152 2792 3631 6457 35 5 59 1 0 1 0 0 2389 47 963 0 0 1332 60 1429 2521 42 4 54 1 0 1 0 0 2387 47 964 0 0 32 956 1643 2917 73 3 24 0 0 1 0 0 2379 47 966 0 0 52 1468 2949 5027 42 6 52 1 0 0 0 0 2379 47 966 0 0 4 0 834 1181 51 1 48 0 0 0 0 0 2379 47 966 0 0 0 0 119 134 0 0 100 0 0 0 0 0 2379 47 966 0 0 0 0 126 343 1 0 99 0 0 0 0 0 2380 47 966 0 0 0 0 105 116 0 0 100 0 0 0 0 0 2380 47 966 0 0 0 368 101 118 0 0 100 0 0 0 0 0 2380 47 966 0 0 0 56 271 364 0 1 99 0 0
The instance had at worth 500MB of free memory and ~ 950MB of cache (eg file system cache which can be reclaimed by the OS).
A snapshot of top:
top - 20:46:30 up 1:03, 3 users, load average: 2.03, 1.78, 0.90 Tasks: 142 total, 3 running, 139 sleeping, 0 stopped, 0 zombie %Cpu(s): 39.8 us, 11.6 sy, 0.0 ni, 48.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st KiB Mem: 4048100 total, 3320404 used, 727696 free, 48352 buffers KiB Swap: 0 total, 0 used, 0 free. 979012 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3910 jenkins 20 0 2123364 1.619g 14476 R 81.1 41.9 6:25.79 php5 -dzend.enable_gc=0 phpunit.php ... 1482 mysql 20 0 624248 169096 10064 S 1.0 4.2 0:20.91 /usr/sbin/mysqld ... 24324 jenkins 20 0 7868 1156 740 R 1.0 0.0 0:00.03 /home/jenkins/workspace/mwext-testextension-php55-composer-trusty/src/extensions/Scribunto/engines/LuaStandalone/binaries/l+ 2149 jenkins 20 0 2419876 212784 15152 S 0.7 5.3 0:25.05 java -jar slave.jar 23810 jenkins 20 0 105776 2044 1044 S 0.7 0.1 0:03.00 sshd: jenkins@pts/1 7 root 20 0 0 0 0 S 0.3 0.0 0:03.43 [rcu_sched]
So I think that rules out the lack of memory on the instance. Unless the fork of Scribunto actually syphon up all memory somehow OR is limited somehow.
Next thing. On Jenkins we have a monitoring system named JavaMelody which can be used to inspect an instance, specially the process list. Eg via a link like https://integration.wikimedia.org/ci/monitoring/nodes/ci-trusty-wikimedia-597386?part=processes which is really the same as ps.
Attached are the PDF output sorted by pid and resident memory (rss):
What this show is that at the point I took the capture, there were several Scribunto/engines/LuaStandalone/binaries/lua5_1_5_linux_64_generic/lua process running. I guess they are on hold somehow. Though the lua_ulimit.sh does not appear.
Last thing, I confirmed that the memory_usage() shown in the debug log is consistent with the value shown in top.
https://gerrit.wikimedia.org/r/#/c/345628/1
https://integration.wikimedia.org/ci/job/mwext-testextension-php55-composer-trusty/259/console
153) LuaStandalone: LuaWikibaseLibraryTests[38]: ... proc_open(): fork failed - Cannot allocate memory
T161698: LuaStandalone: LuaWikibaseLibraryTests - proc_open(): fork failed - Cannot allocate memory
Is failing 5x in a row in this and other commits. Has made merging in WikibaseQualityConstraints repo impossible.
lua_ulimit.sh uses exec, so it's expected to appear in the process list just very briefly.
@hashar Could we try raising the limit and see whether that makes a difference?
Gave it a try directly on the instance running solely the Scribunto tests after the memory limit got raised to 70M which ends up being ulimit -v 68359.
WORKSPACE=/home/jenkins/workspace/mwext-testextension-php55-composer-trusty \ php tests/phpunit/phpunit.php extensions/Scribunto/
The debug log shows 131.0M and PHPUnit says Memory: 149.50MB. There is no spam of "Scribunto_LuaStandaloneInterpreter::terminate: terminating" at the end.
Ran the whole suite again with the ulimit in place but without --log-junit which is memory heavy. The whole suite pass just fine and there are no fork() issues.
debug log reports 912.2M and PHPUnit reports 924.50MB
So my best guess is that:
- our code invokes proc_open()
- Zend invokes fork()
- the memory is copied somehow which is denied by the system due to overflow
@hashar If we can't allocate more resources, can we disable --log-junit for just this job for now?
--log-junit is what fails the build so no.
What fork() is doing is that it tries to allocate the whole virtual memory from the caller. In a trace above the PHP process had 1.8G and the system had 562M free + 959 cached. .That is not enough reclaimable memory to fit the whole 1.8G memory, even if it is never going to be used.
The low level fixes would be:
- add a swap large enough which would trick Linux in thinking there is enough memory. The CI instances have no swap enabled and I haven't found out how to enable in the utility we use to generate the images (diskimage-builder)
- let Linux memory management over commit memory https://www.kernel.org/doc/Documentation/vm/overcommit-accounting . If anything really needs more memory the OOM killer will strike which is usually terrible.
- raise instances RAM from 4GB to 6GB to allow for MediaWiki fork() to fit in (lame)
Note to self, diskimage-builder wipe out /etc/fstab:
cat << EOF | tee /etc/fstab > /dev/null proc /proc proc nodev,noexec,nosuid 0 0 LABEL=${DIB_ROOT_LABEL} / ${FS_TYPE} errors=remount-ro 0 1 EOF
Eg there is no swap defined at all, eg on Trusty:
LABEL=cloudimg-rootfs / ext4 defaults 0 0
On Jessie:
proc /proc proc nodev,noexec,nosuid 0 0 LABEL=cloudimg-rootfs / ext4 errors=remount-ro 0 1
I have no idea what to mount as swap. A /dev a label or whatever else. So we can probably mkswap to a flat file.
Change 346634 had a related patch set uploaded (by Hashar):
[integration/config@master] dib: allow Linux memory overcommit on Trusty
Change 346634 merged by jenkins-bot:
[integration/config@master] dib: allow Linux memory overcommit on Trusty
The Trusty instances that run the php5 jobs now allows Linux to over commit the memory. So a fork() should work now. I triggered a job and it managed to pass just fine. Guess you can check php5 or CR+2 again patches that were falling.
Seems the Wikibase related jobs that use Scribunto are passing just fine. That does not close that task though which is about adding Scribunto to extension-gate.
We tried to add ArticlePlaceholder to the gated extension list in T213815 but had to revert it due to this ticket not being solved :(
I'm going to mark this as unbreak now now, as this has wasted so many people hours over the past weeks / months
Change 497574 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] gate: Add Scribunto
So this task has been Unbreak now for almost two weeks. What could WMDE help with to get this unblocked/resolved?
Hi, sorry for the delay.
- Traditionally UBN! is reserved for direct breakages. I won't quibble (pun intended) with this too much here though.
- We (RelEng) have been very backlogged by current responses to security related issues. That is nearing an end soon, but we have a lot to catch up on.
- There is a legitimate question on the gerrit change regarding the length of time these additional tests take to complete and how that will slow down test runs for all developers. It is an open question on how long we should aim to have our unit tests take and how we should keep the time down even when adding new functionality (eg: adding new extensions to the gate). I would appreciate input on this from others if there is any.
Thanks for the answers @greg.
I understand @Addshore had valid reasons to make the priority of this the highest possible, but given what you say, I'll take the liberty now to bring it back to High for now.
You folks being busy with recent events is completely understandable of course.
I'll see in the next few days if we at WMDE from our side would be able to provide some information related to the topic of slow down due to additional tests being added.
Since this is causing outages, I guess it is really time to tackle the issue.
For the context:
We have wmf-quibble-* jobs which clone several extensions and run all their tests. That is also known as the extension-gate.
Scribunto has a set of test to make sure it works fine with some LUA interpreters. That is really nice but takes a few minutes iirc. When one send a patch for another extension, we are 100% sure they are not going to affect how LUA runs.
The wmf-quibble jobs do:
- trigger for extensions having in zuul/layout.yaml the template extension-gate
- have a list of all extensions listed in zuul/parameter_functions.py in gated_extensions variable
- a test in integration/config ensure that both lists match
Quibble roughly runs the MediaWiki core PHPUnit suite extensions defined as:
<testsuite name="extensions"> <directory>structure</directory> <file>suites/ExtensionsTestSuite.php</file> <file>suites/ExtensionsParserTestSuite.php</file> <file>suites/LessTestSuite.php</file> </testsuite>
It does a first run without tests flagged with @group Database:
tests/phpunit/phpunit.php --testsuite extensions --exclude-group Broken,ParserFuzz,Stub,Database
Then a second run with those Database tests:
tests/phpunit/phpunit.php --testsuite extensions --group Database --exclude-group Broken,ParserFuzz,Stub
The group exclusions are hardcoded in Quibble.
The extension test suite would add tests from any extensions /tests/phpunit directories as well as those added via the UnitTestsList hook.
The behavior I would like is a way to exclude a subset of Scribunto test unless the patch got send for that extension. What I thought is to flag those specific tests with a new group such as @group system-integration. Then Quibble would exclude that group from both of the commands above and add a third run which would be:
tests/phpunit/phpunit.php --group system-integration --exclude-group Broken,ParserFuzz,Stub -- /path/to/extension/that/triggered/the/patch
Which I think would address it?
- There are 6 tests that test PHP hooks (in tests/phpunit/common), with no Lua involved. Probably more should be written.
- These take 1–2 seconds on my laptop.
- There are 67 for LuaStandalone integration (in tests/phpunit/engines/LuaStandalone).
- These take 6–7 seconds on my laptop.
- There are 46 for LuaSandbox integration (in tests/phpunit/engines/LuaSandbox).
- These take about 5 seconds on my laptop.
- Then there are the 1000+ tests in tests/phpunit/engines/LuaCommon, which are run for both LuaSandbox and LuaStandalone to ensure things work under both environments.
- The run for LuaSandbox takes around 34 seconds on my laptop, and the run for LuaStandalone takes around 163 seconds.
The last category is testing the code in case of subtle differences between the two environments, rather than Scribunto itself's integration with the environments, so calling them "system-integration" wouldn't really be accurate.
You could probably use the existing 'LuaStandalone' group like you proposed the 'system-integration' group. I don't think 'system-integration' would be an accurate group name, but if we can think of one that would be accurate for "tests not worth running when testing other extensions" I wouldn't be opposed to having Scribunto's suite tag the LuaStandalone tests with that too.
Similar is FileImporter which has some slow tests ( https://gerrit.wikimedia.org/r/#/c/514065/ ), and IIRC CirrusSearch has a few slow tests we would probably want to skip as well.
I guess we can come up with a PHPUnit group name, in mediawiki/core add a new testsuite that would exclude the group and then switch the wmf-quibble jobs to use that new suite?
The "come up with a PHPUnit group name" part is what I was concerned with, once we have the name I can easily add it to Scribunto's tests. I have no opinion on the core testsuite and quibble details.
If we want a integration test group whose objective is "integration-tests-we-want-to-run-only-in-this-extension", then the name should reflect the intent right from the start to avoid people making assumptions we then have to support.
How about "extension-isolate-integration"?
I created a new task for the creation and implementation of that group, I think it might be better to move the discussion and task around that there.
This somehow happened in the last weekend for wikibase: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/60806/consoleFull (on Wikibase patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/524688)
Does anyone know why? the test job run time has jumped drastically, almost doubled.
Please fill another task, eventually against with ci-test-error (WMF-deployed Build Failure) / ci-test-error :]
I think we can revisit this now. I moved 675 tests of Scribunto from "integration" type to "unit" (T230701: Migrate Scribunto to stop using MediaWikiIntegrationTestCase on unit tests) That made Scribunto tests 40% faster (from ~2 minutes to 1). One big issue is if we add Scribunto to gate, since Wikibase is already there, Wikibase client scribunto tests also will be ran which are 329 more tests (1 more minute on my laptop).
@hashar do you think we can try re adding this extension to the gate jobs and see how the performance is?
Change 587600 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/Scribunto@master] tests: Mark LuaStandalone tests as @group Standalone
Change 587600 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@master] tests: Mark non-LuaSandbox tests as @group Standalone
Mentioned in SAL (#wikimedia-releng) [2020-04-17T21:11:53Z] <James_F> Zuul: Add Scribunto to the gate after only five years of talking about it T125050