Page MenuHomePhabricator

[Task] Add Scribunto to extension-gate in CI
Closed, ResolvedPublic

Description

Add Scribunto to extension-gate in CI.

In T96264 Wikidata was added which has tests that are skipped without Scribunto.

We have tried to add it but the test runtime skyrocket causing gate delay (T126274). Will need to tweak the Scribunto config before being able to add it again.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The problem reported in T161698 blocks merging any of Wikibase patches, so this is becoming more pressing for Wikidata folk. Is there anything on the Wikibase side we could do to help fixing the issue?

hoo raised the priority of this task from Medium to High.Apr 3 2017, 2:12 PM
hoo added a subscriber: Lydia_Pintscher.

We should try and make scribuntu more performant i.e. performance wise. Or at least skip the tests that cause the tests to take so long.

We should try and make scribuntu more performant i.e. performance wise. Or at least skip the tests that cause the tests to take so long.

Not the performance is the problem but the fact that the tests are failing is.

We should try and make scribuntu more performant i.e. performance wise. Or at least skip the tests that cause the tests to take so long.

As I said earlier, it seems this Scribunto failure is a symptom of a problem, not the cause. The high-level cause seems to be that the test suite as a whole is exceeding a memory limitation of some sort. What limitation that is, exactly, and whether there's a configuration setting that could be changed or a leak somewhere that could be fixed is a matter for further investigation.

Mentioned in SAL (#wikimedia-releng) [2017-04-03T20:39:03Z] <hashar> Nodepool: holding instance ci-trusty-wikimedia-597386 in an attempt debug Wikibase/Scribunto memory usage exploding T125050

I manually rebuild mwext-testextension-php55-composer-trusty for Wikibase. Instructed nodepool to not delete the instance (ssh labnodepool1001.eqiad.wmnet nodepool hold <ID of instance) and connected to it while the test was running.

The vmstat at the end of the build. Memory metrics are in megabytes, lines one second apart:

procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu-----
 r  b     swpd     free     buff    cache   si   so    bi    bo   in   cs us sy id wa st
 2  0        0      602       47      957    0    0     0   340 2777 6007 49  3 48  1  0
 1  0        0      591       47      957    0    0     0   288 2806 6929 48  4 48  1  0
 1  0        0      571       47      957    0    0    48    32 4147 9573 48  4 47  1  0
 2  0        0      527       47      957    0    0     0    36 6498 8514 50  5 45  0  0
 1  0        0      562       47      959    0    0     0   372 7061 11920 46 10 43  1  0
 1  0        0      562       47      959    0    0     0    20 1248 2536 48  3 49  0  0
 1  0        0      562       47      959    0    0     0     0  365  116 48  2 50  0  0
 1  0        0      562       47      959    0    0     0  7296  483  954 49  3 48  1  0
 1  0        0      562       47      959    0    0     0  9748  488  889 50  2 48  1  0
 1  0        0     2405       47      959    0    0     4   404 1136 1913 36  8 55  0  0
 0  0        0     2396       47      962    0    0   152  2792 3631 6457 35  5 59  1  0
 1  0        0     2389       47      963    0    0  1332    60 1429 2521 42  4 54  1  0
 1  0        0     2387       47      964    0    0    32   956 1643 2917 73  3 24  0  0
 1  0        0     2379       47      966    0    0    52  1468 2949 5027 42  6 52  1  0
 0  0        0     2379       47      966    0    0     4     0  834 1181 51  1 48  0  0
 0  0        0     2379       47      966    0    0     0     0  119  134  0  0 100  0  0
 0  0        0     2379       47      966    0    0     0     0  126  343  1  0 99  0  0
 0  0        0     2380       47      966    0    0     0     0  105  116  0  0 100  0  0
 0  0        0     2380       47      966    0    0     0   368  101  118  0  0 100  0  0
 0  0        0     2380       47      966    0    0     0    56  271  364  0  1 99  0  0

The instance had at worth 500MB of free memory and ~ 950MB of cache (eg file system cache which can be reclaimed by the OS).

A snapshot of top:

top - 20:46:30 up  1:03,  3 users,  load average: 2.03, 1.78, 0.90
Tasks: 142 total,   3 running, 139 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.8 us, 11.6 sy,  0.0 ni, 48.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem:   4048100 total,  3320404 used,   727696 free,    48352 buffers
KiB Swap:        0 total,        0 used,        0 free.   979012 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                      
 3910 jenkins   20   0 2123364 1.619g  14476 R  81.1 41.9   6:25.79 php5 -dzend.enable_gc=0 phpunit.php ...
 1482 mysql     20   0  624248 169096  10064 S   1.0  4.2   0:20.91 /usr/sbin/mysqld  ...
24324 jenkins   20   0    7868   1156    740 R   1.0  0.0   0:00.03 /home/jenkins/workspace/mwext-testextension-php55-composer-trusty/src/extensions/Scribunto/engines/LuaStandalone/binaries/l+ 
 2149 jenkins   20   0 2419876 212784  15152 S   0.7  5.3   0:25.05 java -jar slave.jar                                                                                                          
23810 jenkins   20   0  105776   2044   1044 S   0.7  0.1   0:03.00 sshd: jenkins@pts/1                                                                                                          
    7 root      20   0       0      0      0 S   0.3  0.0   0:03.43 [rcu_sched]

So I think that rules out the lack of memory on the instance. Unless the fork of Scribunto actually syphon up all memory somehow OR is limited somehow.

Next thing. On Jenkins we have a monitoring system named JavaMelody which can be used to inspect an instance, specially the process list. Eg via a link like https://integration.wikimedia.org/ci/monitoring/nodes/ci-trusty-wikimedia-597386?part=processes which is really the same as ps.

Attached are the PDF output sorted by pid and resident memory (rss):

What this show is that at the point I took the capture, there were several Scribunto/engines/LuaStandalone/binaries/lua5_1_5_linux_64_generic/lua process running. I guess they are on hold somehow. Though the lua_ulimit.sh does not appear.

Last thing, I confirmed that the memory_usage() shown in the debug log is consistent with the value shown in top.

https://gerrit.wikimedia.org/r/#/c/345628/1
https://integration.wikimedia.org/ci/job/mwext-testextension-php55-composer-trusty/259/console

153) LuaStandalone: LuaWikibaseLibraryTests[38]: ...
proc_open(): fork failed - Cannot allocate memory

T161698: LuaStandalone: LuaWikibaseLibraryTests - proc_open(): fork failed - Cannot allocate memory

Is failing 5x in a row in this and other commits. Has made merging in WikibaseQualityConstraints repo impossible.

What this show is that at the point I took the capture, there were several Scribunto/engines/LuaStandalone/binaries/lua5_1_5_linux_64_generic/lua process running. I guess they are on hold somehow. Though the lua_ulimit.sh does not appear.

lua_ulimit.sh uses exec, so it's expected to appear in the process list just very briefly.

@hashar Could we try raising the limit and see whether that makes a difference?

Gave it a try directly on the instance running solely the Scribunto tests after the memory limit got raised to 70M which ends up being ulimit -v 68359.

WORKSPACE=/home/jenkins/workspace/mwext-testextension-php55-composer-trusty \
  php tests/phpunit/phpunit.php extensions/Scribunto/

The debug log shows 131.0M and PHPUnit says Memory: 149.50MB. There is no spam of "Scribunto_LuaStandaloneInterpreter::terminate: terminating" at the end.

Ran the whole suite again with the ulimit in place but without --log-junit which is memory heavy. The whole suite pass just fine and there are no fork() issues.

debug log reports 912.2M and PHPUnit reports 924.50MB

So my best guess is that:

  • our code invokes proc_open()
  • Zend invokes fork()
  • the memory is copied somehow which is denied by the system due to overflow

@hashar If we can't allocate more resources, can we disable --log-junit for just this job for now?

--log-junit is what fails the build so no.

What fork() is doing is that it tries to allocate the whole virtual memory from the caller. In a trace above the PHP process had 1.8G and the system had 562M free + 959 cached. .That is not enough reclaimable memory to fit the whole 1.8G memory, even if it is never going to be used.

The low level fixes would be:

  • add a swap large enough which would trick Linux in thinking there is enough memory. The CI instances have no swap enabled and I haven't found out how to enable in the utility we use to generate the images (diskimage-builder)
  • raise instances RAM from 4GB to 6GB to allow for MediaWiki fork() to fit in (lame)

Note to self, diskimage-builder wipe out /etc/fstab:

elements/debootstrap/install.d/15-cleanup-debootstrap
cat << EOF | tee /etc/fstab > /dev/null
proc /proc proc nodev,noexec,nosuid 0 0
LABEL=${DIB_ROOT_LABEL} / ${FS_TYPE} errors=remount-ro 0 1
EOF

Eg there is no swap defined at all, eg on Trusty:

/etc/fstab
LABEL=cloudimg-rootfs	/	 ext4	defaults	0 0

On Jessie:

/etc/fstab
proc /proc proc nodev,noexec,nosuid 0 0
LABEL=cloudimg-rootfs / ext4 errors=remount-ro 0 1

I have no idea what to mount as swap. A /dev a label or whatever else. So we can probably mkswap to a flat file.

Change 346634 had a related patch set uploaded (by Hashar):
[integration/config@master] dib: allow Linux memory overcommit on Trusty

https://gerrit.wikimedia.org/r/346634

Change 346634 merged by jenkins-bot:
[integration/config@master] dib: allow Linux memory overcommit on Trusty

https://gerrit.wikimedia.org/r/346634

The Trusty instances that run the php5 jobs now allows Linux to over commit the memory. So a fork() should work now. I triggered a job and it managed to pass just fine. Guess you can check php5 or CR+2 again patches that were falling.

Seems the Wikibase related jobs that use Scribunto are passing just fine. That does not close that task though which is about adding Scribunto to extension-gate.

We tried to add ArticlePlaceholder to the gated extension list in T213815 but had to revert it due to this ticket not being solved :(

Addshore raised the priority of this task from High to Unbreak Now!.Mar 19 2019, 5:13 PM

I'm going to mark this as unbreak now now, as this has wasted so many people hours over the past weeks / months

Change 497574 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] gate: Add Scribunto

https://gerrit.wikimedia.org/r/497574

So this task has been Unbreak now for almost two weeks. What could WMDE help with to get this unblocked/resolved?

So this task has been Unbreak now for almost two weeks. What could WMDE help with to get this unblocked/resolved?

Hi, sorry for the delay.

  1. Traditionally UBN! is reserved for direct breakages. I won't quibble (pun intended) with this too much here though.
  2. We (RelEng) have been very backlogged by current responses to security related issues. That is nearing an end soon, but we have a lot to catch up on.
  3. There is a legitimate question on the gerrit change regarding the length of time these additional tests take to complete and how that will slow down test runs for all developers. It is an open question on how long we should aim to have our unit tests take and how we should keep the time down even when adding new functionality (eg: adding new extensions to the gate). I would appreciate input on this from others if there is any.

Thanks for the answers @greg.

I understand @Addshore had valid reasons to make the priority of this the highest possible, but given what you say, I'll take the liberty now to bring it back to High for now.
You folks being busy with recent events is completely understandable of course.
I'll see in the next few days if we at WMDE from our side would be able to provide some information related to the topic of slow down due to additional tests being added.

WMDE-leszek lowered the priority of this task from Unbreak Now! to High.Apr 3 2019, 7:32 AM

Since this is causing outages, I guess it is really time to tackle the issue.

For the context:

We have wmf-quibble-* jobs which clone several extensions and run all their tests. That is also known as the extension-gate.

Scribunto has a set of test to make sure it works fine with some LUA interpreters. That is really nice but takes a few minutes iirc. When one send a patch for another extension, we are 100% sure they are not going to affect how LUA runs.

The wmf-quibble jobs do:

  • trigger for extensions having in zuul/layout.yaml the template extension-gate
  • have a list of all extensions listed in zuul/parameter_functions.py in gated_extensions variable
  • a test in integration/config ensure that both lists match

Quibble roughly runs the MediaWiki core PHPUnit suite extensions defined as:

tests/phpunit/suite.xml
<testsuite name="extensions">
    <directory>structure</directory>
    <file>suites/ExtensionsTestSuite.php</file>
    <file>suites/ExtensionsParserTestSuite.php</file>
    <file>suites/LessTestSuite.php</file>
</testsuite>

It does a first run without tests flagged with @group Database:

tests/phpunit/phpunit.php --testsuite extensions --exclude-group Broken,ParserFuzz,Stub,Database

Then a second run with those Database tests:

tests/phpunit/phpunit.php  --testsuite extensions --group Database --exclude-group Broken,ParserFuzz,Stub

The group exclusions are hardcoded in Quibble.

The extension test suite would add tests from any extensions /tests/phpunit directories as well as those added via the UnitTestsList hook.


The behavior I would like is a way to exclude a subset of Scribunto test unless the patch got send for that extension. What I thought is to flag those specific tests with a new group such as @group system-integration. Then Quibble would exclude that group from both of the commands above and add a third run which would be:

tests/phpunit/phpunit.php --group system-integration  --exclude-group Broken,ParserFuzz,Stub -- /path/to/extension/that/triggered/the/patch

Which I think would address it?

Scribunto has a set of test to make sure it works fine with some LUA interpreters. That is really nice but takes a few minutes iirc. When one send a patch for another extension, we are 100% sure they are not going to affect how LUA runs.

  • There are 6 tests that test PHP hooks (in tests/phpunit/common), with no Lua involved. Probably more should be written.
    • These take 1–2 seconds on my laptop.
  • There are 67 for LuaStandalone integration (in tests/phpunit/engines/LuaStandalone).
    • These take 6–7 seconds on my laptop.
  • There are 46 for LuaSandbox integration (in tests/phpunit/engines/LuaSandbox).
    • These take about 5 seconds on my laptop.
  • Then there are the 1000+ tests in tests/phpunit/engines/LuaCommon, which are run for both LuaSandbox and LuaStandalone to ensure things work under both environments.
    • The run for LuaSandbox takes around 34 seconds on my laptop, and the run for LuaStandalone takes around 163 seconds.

The last category is testing the code in case of subtle differences between the two environments, rather than Scribunto itself's integration with the environments, so calling them "system-integration" wouldn't really be accurate.

You could probably use the existing 'LuaStandalone' group like you proposed the 'system-integration' group. I don't think 'system-integration' would be an accurate group name, but if we can think of one that would be accurate for "tests not worth running when testing other extensions" I wouldn't be opposed to having Scribunto's suite tag the LuaStandalone tests with that too.

Similar is FileImporter which has some slow tests ( https://gerrit.wikimedia.org/r/#/c/514065/ ), and IIRC CirrusSearch has a few slow tests we would probably want to skip as well.

I guess we can come up with a PHPUnit group name, in mediawiki/core add a new testsuite that would exclude the group and then switch the wmf-quibble jobs to use that new suite?

I guess we can come up with a PHPUnit group name, in mediawiki/core add a new testsuite that would exclude the group and then switch the wmf-quibble jobs to use that new suite?

The "come up with a PHPUnit group name" part is what I was concerned with, once we have the name I can easily add it to Scribunto's tests. I have no opinion on the core testsuite and quibble details.

If we want a integration test group whose objective is "integration-tests-we-want-to-run-only-in-this-extension", then the name should reflect the intent right from the start to avoid people making assumptions we then have to support.

How about "extension-isolate-integration"?

I created a new task for the creation and implementation of that group, I think it might be better to move the discussion and task around that there.

This somehow happened in the last weekend for wikibase: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/60806/consoleFull (on Wikibase patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/524688)
Does anyone know why? the test job run time has jumped drastically, almost doubled.

This somehow happened in the last weekend for wikibase: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/60806/consoleFull (on Wikibase patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/524688)
Does anyone know why? the test job run time has jumped drastically, almost doubled.

Please fill another task, eventually against with ci-test-error (WMF-deployed Build Failure) / ci-test-error :]

I think we can revisit this now. I moved 675 tests of Scribunto from "integration" type to "unit" (T230701: Migrate Scribunto to stop using MediaWikiIntegrationTestCase on unit tests) That made Scribunto tests 40% faster (from ~2 minutes to 1). One big issue is if we add Scribunto to gate, since Wikibase is already there, Wikibase client scribunto tests also will be ran which are 329 more tests (1 more minute on my laptop).

alaa_wmde moved this task from Incoming to Needs Work on the Wikidata-Campsite board.
alaa_wmde added a subscriber: alaa_wmde.

Need to figure out the current status of this and why/when we should do it

@hashar do you think we can try re adding this extension to the gate jobs and see how the performance is?

Change 587600 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/Scribunto@master] tests: Mark LuaStandalone tests as @group Standalone

https://gerrit.wikimedia.org/r/587600

Change 587600 merged by jenkins-bot:
[mediawiki/extensions/Scribunto@master] tests: Mark non-LuaSandbox tests as @group Standalone

https://gerrit.wikimedia.org/r/587600

Change 497574 merged by jenkins-bot:
[integration/config@master] gate: Add Scribunto

https://gerrit.wikimedia.org/r/497574

Mentioned in SAL (#wikimedia-releng) [2020-04-17T21:11:53Z] <James_F> Zuul: Add Scribunto to the gate after only five years of talking about it T125050