Page MenuHomePhabricator

Reduce runtime of MW shared gate Jenkins jobs to 5 min
Open, NormalPublic

Description

Objective

For the "typical" time it takes for a commit to be approved and landed in master to be 5 minutes or less.

Status quo

As of 11 June 2019, the gate usually takes around 20 minutes.

The two slowest jobs typically take 13-17 minutes each. The time for the gate overall is rarely under 15 minutes, because we run multiple of these jobs (increasing the chances of random slowness), and while they can run in parallel, they don't always start immediately - given limited CI execution slots.

Below a sample from a MediaWiki commit (master branch):

Gate pipeline build succeeded.
  • wmf-quibble-core-vendor-mysql-php72-docker SUCCESS in 12m 03s
  • wmf-quibble-core-vendor-mysql-hhvm-docker SUCCESS in 14m 12s
  • mediawiki-quibble-vendor-mysql-php72-docker SUCCESS in 7m 34s
  • mediawiki-quibble-vendor-mysql-php71-docker SUCCESS in 7m 12s
  • mediawiki-quibble-vendor-mysql-php70-docker SUCCESS in 6m 48s
  • mediawiki-quibble-vendor-mysql-hhvm-docker SUCCESS in 8m 32s
  • mediawiki-quibble-vendor-postgres-php72-docker SUCCESS in 10m 05s
  • mediawiki-quibble-vendor-sqlite-php72-docker SUCCESS in 7m 04s
  • mediawiki-quibble-composer-mysql-php70-docker SUCCESS in 8m 14s

(+ jobs that take less than 3 minutes: composer-test, npm-test, and phan.)

These can be grouped in two kinds of jobs:

  • wmf-quibble: These install MW with the gated extensions, and then run all PHPUnit, Selenium and QUnit tests.
  • mediawiki-quibble: These install MW bundled extensions only, and then run PHPUnit, Selenium and QUnit tests.

Stats from wmf-quibble-core-vendor-mysql-php72-docker

  • 9-15 minutes (mediawiki-core, master, gate-and-submit)
  • Sample:
    • PHPUnit (dbless): 1.91 minutes / 15,782 tests.
    • QUnit: 29 seconds / 1286 tests.
    • Selenium: 143 seconds / 43 tests.
    • PHPUnit (db): 3.85 minutes / 4377 tests.

Stats from wmf-quibble-core-vendor-mysql-hhvm-docker:

  • 13-17 minutes (mediawiki-core, master, gate-and-submit)
  • Sample:
    • PHPUnit (dbless): 2.43 minutes / 15,782 tests.
    • QUnit: 32 seconds / 1286 tests.
    • Selenium: 187 seconds / 43 tests.
    • PHPUnit (db): 4.78 minutes / 4377 tests.


Scope of task

This task represents the goal of reaching 5 minutes or less. The work tracked here includes researching ways to get there, trying them out, and putting one or more ideas into practice. The task can be closed once we have reached the goal or if we have concluded it isn't feasible or useful.

Feel free to add/remove subtasks as we go along and consider different things.

Stuff done

Ideas to explore and related work

  • Look at the PHPUnit "Test Report" for a commit and sort the root by duration. Find the slowest ones and look at its test suite to look for ways to improve it. Is it repeating expensive setups? Perhaps that can be skipped or re-used. Is it running hundreds of variations for the same integration test? Perhaps reduce it to just 1 case for that story, and apply the remaining cases to a lighter unit test instead.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle triaged this task as Normal priority.Jun 17 2019, 6:12 PM

Change 517202 merged by jenkins-bot:
[mediawiki/core@master] Add a report about slow PHPUnit tests

https://gerrit.wikimedia.org/r/517202

Change 517686 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Babel@master] Make BabelTest less slow (don't create a dozen wiki pages for every test)

https://gerrit.wikimedia.org/r/517686

Change 517696 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] AutoLoader: Make structure test less slow

https://gerrit.wikimedia.org/r/517696

Change 517686 merged by jenkins-bot:
[mediawiki/extensions/Babel@master] Make BabelTest less slow (don't create a dozen wiki pages for every test)

https://gerrit.wikimedia.org/r/517686

2487ms to run AutoLoaderStructureTest:testAutoloadOrder

I've instrumented generateLocalAutoload.php locally with php72-tideways (see P8629):

pairwall-timecall count
main()9,300 ms1
main()==>AutoloadGenerator::initMediaWikiDefault9,298 ms1
AutoloadGenerator::initMediaWikiDefault==>AutoloadGenerator::readDir9,219 ms4
AutoloadGenerator::readDir==>AutoloadGenerator::readFile9,144 ms2162
AutoloadGenerator::readFile==>ClassCollector::getClasses9,111 ms2172
AutoloadGenerator::initMediaWikiDefault==>AutoloadGenerator::readFile78 ms10
ClassCollector::getClasses==>ClassCollector::tryBeginExpect56 ms3452565
ClassCollector::getClasses==>ClassCollector::tryEndExpect8 ms34093
ClassCollector::tryEndExpect==>ClassCollector::implodeTokens2 ms2078

I've filtered out most <1ms entires and other uninteresting pairs. (It's about 4x slower for me locally than in CI)

The profile shows most the inner most method that still contains most of the time spent is ClassCollector::getClasses. Specifically,

  • not the time spent iterating directories. (readDir)
  • not the time spent reading file contents from disk. (readFile)
  • not the time spent in the lower ClassCollector methods. (tryBeginExpect, tryEndExpect)

This is what that leaves us with:

ClassCollector.php - simplified
public function getClasses( $code ) {
	$this->startToken = null;
	foreach ( token_get_all( $code ) as $token ) {
		if ( $this->startToken === null ) {
			$this->tryBeginExpect( $token ); // fast
		} else {
			$this->tryEndExpect( $token ); // fast
		}
	}
}

This means it has to come from token_get_all(). This isn't surprising. The parsing of the PHP syntax is certainly non-trivial and we do it for thousands of files. I just figured that maybe the I/O was at least as expensive as that, but apparently not.

Of the 9298 ms in total, 9047 ms (9111-56-8) is spent in the PHP tokenizer.

We do this to extract namespace and class statement, and calls to class_alias(). The other 99.9% of the code is parsed for no reason other than that the only way to know what to skip is to tokenise it, so.... fair enough. I've looked at what PHP_CodeSniffer and Phan to see if maybe they're using a faster tokenizer, but nope, they use the same built-in function.

Unlike Phan and PHPCS, though, we don't do any complex analysis, nor do we try to find inline operators that are hard to disambiguate. I tried patching it with a basic regular expression to skip the vast majority of lines in the PHP files that we don't need to tokenise.

The result is a 35X reduction (or 97% off):

pairwall-timecall count
main()252 ms1
main()==>AutoloadGenerator::initMediaWikiDefault250 ms1
AutoloadGenerator::initMediaWikiDefault==>AutoloadGenerator::readDir249 ms4
AutoloadGenerator::readDir==>AutoloadGenerator::readFile188 ms2162
AutoloadGenerator::readFile==>ClassCollector::getClasses96 ms2172
ClassCollector::getClasses==>ClassCollector::tryEndExpect6 ms5573
AutoloadGenerator::initMediaWikiDefault==>AutoloadGenerator::readFile0 ms10
ClassCollector::tryEndExpect==>ClassCollector::implodeTokens0 ms2078
ClassCollector::getClasses==>ClassCollector::tryBeginExpect0 ms18343

Change 517951 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Document which FileModule methods use a DB

https://gerrit.wikimedia.org/r/517951

Change 517696 merged by jenkins-bot:
[mediawiki/core@master] AutoLoader: Skip tokenizing of irrelevant lines in ClassCollector

https://gerrit.wikimedia.org/r/517696

Change 517951 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Document which FileModule methods use a DB

https://gerrit.wikimedia.org/r/517951

hashar updated the task description. (Show Details)Jul 2 2019, 9:32 AM
Krinkle updated the task description. (Show Details)Jul 2 2019, 1:59 PM
Krinkle updated the task description. (Show Details)Jul 2 2019, 6:44 PM

Change 523012 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] structure: Speed up dependency checks in ResourcesTest

https://gerrit.wikimedia.org/r/523012

Change 523012 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Speed up dependency checks in structure/ResourcesTest

https://gerrit.wikimedia.org/r/523012

Change 524688 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Avoid using MediaWikiIntegrationTestCase on unit tests

https://gerrit.wikimedia.org/r/524688

Change 524688 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Avoid using MediaWikiIntegrationTestCase on unit tests

https://gerrit.wikimedia.org/r/524688

Krinkle updated the task description. (Show Details)Jul 22 2019, 4:50 PM

Change 525675 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] phpunit: Attempt to make MediaWikiCoversValidator a little faster

https://gerrit.wikimedia.org/r/525675

Change 525675 merged by jenkins-bot:
[mediawiki/core@master] phpunit: Avoid get_class() in MediaWikiCoversValidator

https://gerrit.wikimedia.org/r/525675

Change 526812 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Remove slow structure test for checking getVersionHash

https://gerrit.wikimedia.org/r/526812

Change 526812 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Remove slow structure test for checking getVersionHash

https://gerrit.wikimedia.org/r/526812

Change 528903 had a related patch set uploaded (by Krinkle; owner: Ladsgroup):
[mediawiki/core@master] localisation: Add process cache to LCStoreDB

https://gerrit.wikimedia.org/r/528903

Change 528933 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[integration/quibble@master] Set cache directory

https://gerrit.wikimedia.org/r/528933

Change 528933 merged by jenkins-bot:
[integration/quibble@master] Set cache directory

https://gerrit.wikimedia.org/r/528933

I think the best returns here are going to be not on pinpointing specific tests, but looking for systemic issues. Even if you find a test that takes ten seconds, optimizing it will save at most ten seconds, and there aren't enough such tests to add up to the target savings of 10+ minutes.

If we have enough computational resources and the problem is just latency, it seems obvious that the best thing to do is just split up the long-running test suites into little pieces. I don't think it would be hard to find some breakup of subdirectories such that no piece takes more than three minutes to run. As long as the setup time doesn't take long, you can get the latency as low as you like like that.

If we want to save on computational resources, then it would be worth getting a profiler trace of an entire test run and just look for low-hanging fruit. It's entirely possible that the easiest optimizations will be in MediaWiki itself and not in the tests.

@Simetrical Agreed. The slow test detection isn't meant to invite micro-optimisation for individual test cases. It's meant to expose systemic issues that can be addressed wholesale by looking at them as an example and finding out why they are slow. The only case of a slow that I sped up for its own sake only was AutoLoaderStructureTest because its a snow-flake test that represented a big chunk of the total.

  • High-level ideas are things like switching MySQL's datadir to tmpfs (a few years ago), which cut the time in half (T96230).
  • Not resetting the database for every single unit test or making the resetting logic itself smarter/faster.
  • The new lightweight MediaWikiUnitTestCase base class, many tests are able to run faster without needless overhead of initialisation/resetting for systems that aren't involved.

I hope to see the faster LCStore enabled in Quibble soon as well which should help.

Another change we can reconsider is how we split up the jobs. We already run composer-test and phan separately. But, npm-test, qunit and wdio are still in the same repeated job as phpunit. Letting that run in a concurrent job on one php flavour only should speed things up (higher concurrency, faster turnaround, fewer CI resources needed / less CI stalling).

If you want to split things up, split up PHPUnit tests themselves into multiple chunks. That will allow reducing latencies more or less arbitrarily, assuming we have adequate server resources and we don't spend too much time in setting up the test instance.

Do you have a profile of a full test run? That should be very helpful.

It might also be worth checking if tests are waiting at all on I/O. They shouldn't be, everything should be in-memory. If any noticeable fraction of the CPU usage is spent in iowait, that should be easy to cut to zero.

Looking at the test reports, I don't think there's any test that can really be considered as a bottleneck. Sure, converting lots of tests to be actual unit tests would help, but I'm uncertain about the possible runtime reduction. What I think would really make things faster is T225628 + T192166. Without PHP7.0 and HHVM, it's 4 jobs less. On a total of the 20 jobs reported in the task description, that'd be a huge boost. Those tasks (especially T225628) could turn out to be the easiest thing we could do for the greater improvement.

Again, there are two concerns: reducing latency, and reducing server load. The two are only related if the servers are actually overloaded and therefore individual jobs take longer. What exactly is our concern here? Latency, server load, or both?

Again, there are two concerns: reducing latency, and reducing server load. The two are only related if the servers are actually overloaded and therefore individual jobs take longer. What exactly is our concern here? Latency, server load, or both?

I can't say for sure, but I believe both. It's happening pretty often that CI is overloaded by tons of patches, and that ends up slowing down all builds. And the more patch you send, the slower it gets. It could be just an impression, but I feel like this became even worse than usual in the last days.

For the overloading issue, is there any option of just getting more resources? Maybe using EC2 or similar? We're trying to add lots more tests, and one way or another, we're going to need more resources to run them.

Ladsgroup added a comment.EditedFri, Aug 30, 2:06 PM

Looking at the test reports, I don't think there's any test that can really be considered as a bottleneck. Sure, converting lots of tests to be actual unit tests would help, but I'm uncertain about the possible runtime reduction. What I think would really make things faster is T225628 + T192166. Without PHP7.0 and HHVM, it's 4 jobs less. On a total of the 20 jobs reported in the task description, that'd be a huge boost. Those tasks (especially T225628) could turn out to be the easiest thing we could do for the greater improvement.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week) but in general jobs go straight to being ran. OTOH dropping hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week)

AFAICS it's happening right now, too, isn't it? I see it pretty often...

OTOH hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Yes, I've also noticed that HHVM jobs are slower, and always wondered whether it's just a coincidence, they have less resources, they run more tests, or it's just HHVM that's slow.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week)

AFAICS it's happening right now, too, isn't it? I see it pretty often...

Maybe my timing is weird but I barely see that and if I see those, they are very small percentage of the total time.

OTOH hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Yes, I've also noticed that HHVM jobs are slower, and always wondered whether it's just a coincidence, they have less resources, they run more tests, or it's just HHVM that's slow.

HHVM is just slower than php7, at least in tests.

Maybe my timing is weird but I barely see that and if I see those, they are very small percentage of the total time.

It depends... Having like 5 hours of runtime is very rare (fortunately!), but if you take a look at https://integration.wikimedia.org/zuul/ right now, there's e.g. a core patch sitting there since 26 minutes for which no jobs are running. The task description talked about "13-17" minutes, so...

HHVM is just slower than php7, at least in tests.

Nice to know, thanks.

Krinkle added a comment.EditedWed, Sep 4, 5:49 PM

If you want to split things up, split up PHPUnit tests themselves into multiple chunks.

See also T50217, which is on the list.

Task description

As of 11 June 2019, the gate usually takes around 20 minutes.
The two slowest jobs typically take 13-17 minutes each. The time for the gate overall is rarely under 15 minutes.

While some of it got better, in the last few weeks things have very rapidly gotten worse. The PHPUnit and WDIO steps have more than doubled in their run-time. Some hard data on this at T232626.

For Selenium/Qunit, we should be able to extract them to a standalone job. T232759

Is it possible to have the default behavior be that if one job fails, all other jobs for the build are terminated? That way you get much quicker feedback and reduce resource usage. (I think this idea has been discussed somewhere but not sure if there's a phab task for it.)

There could maybe also be an option like "recheck dont-stop-on-failure" which would use the current behavior, of letting all jobs run even if one or more have failed.

Is it possible to have the default behavior be that if one job fails, all other jobs for the build are terminated? That way you get much quicker feedback and reduce resource usage. (I think this idea has been discussed somewhere but not sure if there's a phab task for it.)

That logic is in the Zuul scheduler, it continuously process items in the queues to check their status and whether jobs have completed for a change but only reports once all jobs have completed. The code is:

zuul/scheduler.py
1167 class BasePipelineManager(object):
...
1557     def _processOneItem(self, item, nnfi):
...
1606         if self.pipeline.didAnyJobFail(item):  # <<< flag the item has failing as soon as a job failled
1607             failing_reasons.append("at least one job failed")
...
1612         if ((not item_ahead) and self.pipeline.areAllJobsComplete(item)   # <<< but a conditional on all jobs having completed
1613             and item.live):
1614             try:
1615                 self.reportItem(item)    # and once all jobs have completed, report back to Gerrit.
1616             except exceptions.MergeFailure:
1617                 failing_reasons.append("it did not merge")
1618                 for item_behind in item.items_behind:
1619                     self.log.info("Resetting builds for change %s because the "
1620                                   "item ahead, %s, failed to merge" %
1621                                   (item_behind.change, item))
1622                     self.cancelJobs(item_behind)
1623             self.dequeueItem(item)  # Finally remove the item from the queue after it has been reported
1624             changed = True

I cant tell what kind of madness might occur if we changed the conditional from areAllJobsComplete to didAnyJobFail :-\

There could maybe also be an option like "recheck dont-stop-on-failure" which would use the current behavior, of letting all jobs run even if one or more have failed.

I am pretty sure that can not be implemented. At least not with some major code to be added on top our forked/outdated Zuul fork :-\

Change 537469 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: rebuild for quibble 0.0.35

https://gerrit.wikimedia.org/r/537469

Change 537469 merged by jenkins-bot:
[integration/config@master] docker: rebuild for quibble 0.0.35

https://gerrit.wikimedia.org/r/537469

Change 537485 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Upgrade all quibble jobs to 0.0.35

https://gerrit.wikimedia.org/r/537485

Change 537485 merged by jenkins-bot:
[integration/config@master] jjb: Upgrade all quibble jobs to 0.0.35

https://gerrit.wikimedia.org/r/537485

Change 528933 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[integration/quibble@master] Set cache directory
https://gerrit.wikimedia.org/r/528933

This has been released and deployed on September 17th with Quibble 0.0.35, so that we need have Quibble setting $wgCacheDirectory. I haven't verified the outcome of the change though. Eventually when testing it I found out the Sqlite object cache is not properly initialized and not used that is T233117

This has been released and deployed on September 17th with Quibble 0.0.35, so that we need have Quibble setting $wgCacheDirectory. I haven't verified the outcome of the change though. Eventually when testing it I found out the Sqlite object cache is not properly initialized and not used that is T233117

This still needs this patch to be merged to be more effective I think.

Change 537743 merged by jenkins-bot:
[integration/config@master] jjb: [quibble] Fix syntax for --skip command
https://gerrit.wikimedia.org/r/537743

This will reduce the quibble job length by skipping "npm test" step (which is already covered by the node10-test jobs). See also T233143 and T225248#5485061.

Krinkle updated the task description. (Show Details)Wed, Sep 18, 10:07 PM