Page MenuHomePhabricator

Reduce runtime of MW shared gate Jenkins jobs to 5 min
Open, MediumPublic

Description

Objective

For the "typical" time it takes for a commit to be approved and landed in master to be 5 minutes or less.

Status quo

As of 11 June 2019, the gate usually takes around 20 minutes.

The two slowest jobs typically take 13-17 minutes each. The time for the gate overall is rarely under 15 minutes, because we run multiple of these jobs (increasing the chances of random slowness), and while they can run in parallel, they don't always start immediately - given limited CI execution slots.

Below a sample from a MediaWiki commit (master branch):

Gate pipeline build succeeded.
  • wmf-quibble-core-vendor-mysql-php72-docker SUCCESS in 12m 03s
  • wmf-quibble-core-vendor-mysql-hhvm-docker SUCCESS in 14m 12s
  • mediawiki-quibble-vendor-mysql-php72-docker SUCCESS in 7m 34s
  • mediawiki-quibble-vendor-mysql-php71-docker SUCCESS in 7m 12s
  • mediawiki-quibble-vendor-mysql-php70-docker SUCCESS in 6m 48s
  • mediawiki-quibble-vendor-mysql-hhvm-docker SUCCESS in 8m 32s
  • mediawiki-quibble-vendor-postgres-php72-docker SUCCESS in 10m 05s
  • mediawiki-quibble-vendor-sqlite-php72-docker SUCCESS in 7m 04s
  • mediawiki-quibble-composer-mysql-php70-docker SUCCESS in 8m 14s

(+ jobs that take less than 3 minutes: composer-test, npm-test, and phan.)

These can be grouped in two kinds of jobs:

  • wmf-quibble: These install MW with the gated extensions, and then run all PHPUnit, Selenium and QUnit tests.
  • mediawiki-quibble: These install MW bundled extensions only, and then run PHPUnit, Selenium and QUnit tests.

Stats from wmf-quibble-core-vendor-mysql-php72-docker

  • 9-15 minutes (mediawiki-core, master, gate-and-submit)
  • Sample:
    • PHPUnit (dbless): 1.91 minutes / 15,782 tests.
    • QUnit: 29 seconds / 1286 tests.
    • Selenium: 143 seconds / 43 tests.
    • PHPUnit (db): 3.85 minutes / 4377 tests.

Stats from wmf-quibble-core-vendor-mysql-hhvm-docker:

  • 13-17 minutes (mediawiki-core, master, gate-and-submit)
  • Sample:
    • PHPUnit (dbless): 2.43 minutes / 15,782 tests.
    • QUnit: 32 seconds / 1286 tests.
    • Selenium: 187 seconds / 43 tests.
    • PHPUnit (db): 4.78 minutes / 4377 tests.


Scope of task

This task represents the goal of reaching 5 minutes or less. The work tracked here includes researching ways to get there, trying them out, and putting one or more ideas into practice. The task can be closed once we have reached the goal or if we have concluded it isn't feasible or useful.

Feel free to add/remove subtasks as we go along and consider different things.

Stuff done

Ideas to explore and related work

  • Look at the PHPUnit "Test Report" for a commit and sort the root by duration. Find the slowest ones and look at its test suite to look for ways to improve it. Is it repeating expensive setups? Perhaps that can be skipped or re-used. Is it running hundreds of variations for the same integration test? Perhaps reduce it to just 1 case for that story, and apply the remaining cases to a lighter unit test instead.

Related Objects

StatusAssignedTask
OpenNone
ResolvedLadsgroup
Resolvedaaron
Resolvedhashar
Resolvedaborrero
Resolvedhashar
Openhashar
OpenNone
ResolvedKrinkle
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
DeclinedJdforrester-WMF
ResolvedKrinkle
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 524688 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Avoid using MediaWikiIntegrationTestCase on unit tests

https://gerrit.wikimedia.org/r/524688

Change 524688 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Avoid using MediaWikiIntegrationTestCase on unit tests

https://gerrit.wikimedia.org/r/524688

Krinkle updated the task description. (Show Details)Jul 22 2019, 4:50 PM

Change 525675 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] phpunit: Attempt to make MediaWikiCoversValidator a little faster

https://gerrit.wikimedia.org/r/525675

Change 525675 merged by jenkins-bot:
[mediawiki/core@master] phpunit: Avoid get_class() in MediaWikiCoversValidator

https://gerrit.wikimedia.org/r/525675

Change 526812 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Remove slow structure test for checking getVersionHash

https://gerrit.wikimedia.org/r/526812

Change 526812 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Remove slow structure test for checking getVersionHash

https://gerrit.wikimedia.org/r/526812

Change 528903 had a related patch set uploaded (by Krinkle; owner: Ladsgroup):
[mediawiki/core@master] localisation: Add process cache to LCStoreDB

https://gerrit.wikimedia.org/r/528903

Change 528933 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[integration/quibble@master] Set cache directory

https://gerrit.wikimedia.org/r/528933

Change 528933 merged by jenkins-bot:
[integration/quibble@master] Set cache directory

https://gerrit.wikimedia.org/r/528933

I think the best returns here are going to be not on pinpointing specific tests, but looking for systemic issues. Even if you find a test that takes ten seconds, optimizing it will save at most ten seconds, and there aren't enough such tests to add up to the target savings of 10+ minutes.

If we have enough computational resources and the problem is just latency, it seems obvious that the best thing to do is just split up the long-running test suites into little pieces. I don't think it would be hard to find some breakup of subdirectories such that no piece takes more than three minutes to run. As long as the setup time doesn't take long, you can get the latency as low as you like like that.

If we want to save on computational resources, then it would be worth getting a profiler trace of an entire test run and just look for low-hanging fruit. It's entirely possible that the easiest optimizations will be in MediaWiki itself and not in the tests.

@Simetrical Agreed. The slow test detection isn't meant to invite micro-optimisation for individual test cases. It's meant to expose systemic issues that can be addressed wholesale by looking at them as an example and finding out why they are slow. The only case of a slow that I sped up for its own sake only was AutoLoaderStructureTest because its a snow-flake test that represented a big chunk of the total.

  • High-level ideas are things like switching MySQL's datadir to tmpfs (a few years ago), which cut the time in half (T96230).
  • Not resetting the database for every single unit test or making the resetting logic itself smarter/faster.
  • The new lightweight MediaWikiUnitTestCase base class, many tests are able to run faster without needless overhead of initialisation/resetting for systems that aren't involved.

I hope to see the faster LCStore enabled in Quibble soon as well which should help.

Another change we can reconsider is how we split up the jobs. We already run composer-test and phan separately. But, npm-test, qunit and wdio are still in the same repeated job as phpunit. Letting that run in a concurrent job on one php flavour only should speed things up (higher concurrency, faster turnaround, fewer CI resources needed / less CI stalling).

If you want to split things up, split up PHPUnit tests themselves into multiple chunks. That will allow reducing latencies more or less arbitrarily, assuming we have adequate server resources and we don't spend too much time in setting up the test instance.

Do you have a profile of a full test run? That should be very helpful.

It might also be worth checking if tests are waiting at all on I/O. They shouldn't be, everything should be in-memory. If any noticeable fraction of the CPU usage is spent in iowait, that should be easy to cut to zero.

Looking at the test reports, I don't think there's any test that can really be considered as a bottleneck. Sure, converting lots of tests to be actual unit tests would help, but I'm uncertain about the possible runtime reduction. What I think would really make things faster is T225628 + T192166. Without PHP7.0 and HHVM, it's 4 jobs less. On a total of the 20 jobs reported in the task description, that'd be a huge boost. Those tasks (especially T225628) could turn out to be the easiest thing we could do for the greater improvement.

Again, there are two concerns: reducing latency, and reducing server load. The two are only related if the servers are actually overloaded and therefore individual jobs take longer. What exactly is our concern here? Latency, server load, or both?

Again, there are two concerns: reducing latency, and reducing server load. The two are only related if the servers are actually overloaded and therefore individual jobs take longer. What exactly is our concern here? Latency, server load, or both?

I can't say for sure, but I believe both. It's happening pretty often that CI is overloaded by tons of patches, and that ends up slowing down all builds. And the more patch you send, the slower it gets. It could be just an impression, but I feel like this became even worse than usual in the last days.

For the overloading issue, is there any option of just getting more resources? Maybe using EC2 or similar? We're trying to add lots more tests, and one way or another, we're going to need more resources to run them.

Ladsgroup added a comment.EditedAug 30 2019, 2:06 PM

Looking at the test reports, I don't think there's any test that can really be considered as a bottleneck. Sure, converting lots of tests to be actual unit tests would help, but I'm uncertain about the possible runtime reduction. What I think would really make things faster is T225628 + T192166. Without PHP7.0 and HHVM, it's 4 jobs less. On a total of the 20 jobs reported in the task description, that'd be a huge boost. Those tasks (especially T225628) could turn out to be the easiest thing we could do for the greater improvement.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week) but in general jobs go straight to being ran. OTOH dropping hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week)

AFAICS it's happening right now, too, isn't it? I see it pretty often...

OTOH hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Yes, I've also noticed that HHVM jobs are slower, and always wondered whether it's just a coincidence, they have less resources, they run more tests, or it's just HHVM that's slow.

Dropping php7.0 wouldn't make much difference, overload and job staying in the queue barely happens (like when we had the incident last week)

AFAICS it's happening right now, too, isn't it? I see it pretty often...

Maybe my timing is weird but I barely see that and if I see those, they are very small percentage of the total time.

OTOH hhvm actually makes things faster because it's the slowest job and it's job that determines the overall run time (Amdahl's Law). For example, the slowest wikibase job is hhvm that takes 20 minutes, the second slowest job takes 14 minutes. So if we drop hhvm, you can save 6 minutes which is crazy.

Yes, I've also noticed that HHVM jobs are slower, and always wondered whether it's just a coincidence, they have less resources, they run more tests, or it's just HHVM that's slow.

HHVM is just slower than php7, at least in tests.

Maybe my timing is weird but I barely see that and if I see those, they are very small percentage of the total time.

It depends... Having like 5 hours of runtime is very rare (fortunately!), but if you take a look at https://integration.wikimedia.org/zuul/ right now, there's e.g. a core patch sitting there since 26 minutes for which no jobs are running. The task description talked about "13-17" minutes, so...

HHVM is just slower than php7, at least in tests.

Nice to know, thanks.

Krinkle added a comment.EditedSep 4 2019, 5:49 PM

If you want to split things up, split up PHPUnit tests themselves into multiple chunks.

See also T50217, which is on the list.

Task description

As of 11 June 2019, the gate usually takes around 20 minutes.
The two slowest jobs typically take 13-17 minutes each. The time for the gate overall is rarely under 15 minutes.

While some of it got better, in the last few weeks things have very rapidly gotten worse. The PHPUnit and WDIO steps have more than doubled in their run-time. Some hard data on this at T232626.

For Selenium/Qunit, we should be able to extract them to a standalone job. T232759

Is it possible to have the default behavior be that if one job fails, all other jobs for the build are terminated? That way you get much quicker feedback and reduce resource usage. (I think this idea has been discussed somewhere but not sure if there's a phab task for it.)

There could maybe also be an option like "recheck dont-stop-on-failure" which would use the current behavior, of letting all jobs run even if one or more have failed.

Is it possible to have the default behavior be that if one job fails, all other jobs for the build are terminated? That way you get much quicker feedback and reduce resource usage. (I think this idea has been discussed somewhere but not sure if there's a phab task for it.)

That logic is in the Zuul scheduler, it continuously process items in the queues to check their status and whether jobs have completed for a change but only reports once all jobs have completed. The code is:

zuul/scheduler.py
1167 class BasePipelineManager(object):
...
1557     def _processOneItem(self, item, nnfi):
...
1606         if self.pipeline.didAnyJobFail(item):  # <<< flag the item has failing as soon as a job failled
1607             failing_reasons.append("at least one job failed")
...
1612         if ((not item_ahead) and self.pipeline.areAllJobsComplete(item)   # <<< but a conditional on all jobs having completed
1613             and item.live):
1614             try:
1615                 self.reportItem(item)    # and once all jobs have completed, report back to Gerrit.
1616             except exceptions.MergeFailure:
1617                 failing_reasons.append("it did not merge")
1618                 for item_behind in item.items_behind:
1619                     self.log.info("Resetting builds for change %s because the "
1620                                   "item ahead, %s, failed to merge" %
1621                                   (item_behind.change, item))
1622                     self.cancelJobs(item_behind)
1623             self.dequeueItem(item)  # Finally remove the item from the queue after it has been reported
1624             changed = True

I cant tell what kind of madness might occur if we changed the conditional from areAllJobsComplete to didAnyJobFail :-\

There could maybe also be an option like "recheck dont-stop-on-failure" which would use the current behavior, of letting all jobs run even if one or more have failed.

I am pretty sure that can not be implemented. At least not with some major code to be added on top our forked/outdated Zuul fork :-\

Change 537469 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: rebuild for quibble 0.0.35

https://gerrit.wikimedia.org/r/537469

Change 537469 merged by jenkins-bot:
[integration/config@master] docker: rebuild for quibble 0.0.35

https://gerrit.wikimedia.org/r/537469

Change 537485 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Upgrade all quibble jobs to 0.0.35

https://gerrit.wikimedia.org/r/537485

Change 537485 merged by jenkins-bot:
[integration/config@master] jjb: Upgrade all quibble jobs to 0.0.35

https://gerrit.wikimedia.org/r/537485

Change 528933 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[integration/quibble@master] Set cache directory
https://gerrit.wikimedia.org/r/528933

This has been released and deployed on September 17th with Quibble 0.0.35, so that we need have Quibble setting $wgCacheDirectory. I haven't verified the outcome of the change though. Eventually when testing it I found out the Sqlite object cache is not properly initialized and not used that is T233117

This has been released and deployed on September 17th with Quibble 0.0.35, so that we need have Quibble setting $wgCacheDirectory. I haven't verified the outcome of the change though. Eventually when testing it I found out the Sqlite object cache is not properly initialized and not used that is T233117

This still needs this patch to be merged to be more effective I think.

Change 537743 merged by jenkins-bot:
[integration/config@master] jjb: [quibble] Fix syntax for --skip command
https://gerrit.wikimedia.org/r/537743

This will reduce the quibble job length by skipping "npm test" step (which is already covered by the node10-test jobs). See also T233143 and T225248#5485061.

Krinkle updated the task description. (Show Details)Sep 18 2019, 10:07 PM

Change 528903 abandoned by Ladsgroup:
localisation: Add process cache to LCStoreDB

Reason:
Not needed anymore

https://gerrit.wikimedia.org/r/528903

Krinkle updated the task description. (Show Details)Sep 26 2019, 3:35 AM

Change 539691 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Change 539691 merged by jenkins-bot:
[mediawiki/core@master] selenium: Remove "RunJobs" wait from specialrecentchanges test

https://gerrit.wikimedia.org/r/539691

Krinkle added a comment.EditedOct 12 2019, 10:16 PM

The recent doubling and other regressions in test runtime appear to have settled.

Progress report on the wmf-quibble-core-vendor-mysql-php72-docker job. (As run on mediawiki-core/master/gate-and-submit.)

TotalPHPUnit (pure unit)PHPUnit (dbless)QUnitSeleniumPHPUnit (db)
June 20199-15 min0 s (N/A)~ 2 min (15,782 tests)~ 30 s (1286 tests)~ 2.5 min (43 tests)~ 3.5 min (4377 tests)
October 20199-12 min9 s (5,788 tests)~ 2 min (17,703 tests)~ 30 s (1338 tests)0 s (N/A)~ 5.5 min (4397 tests)

Note:

  • The "pure unit" run for PHPUnit was newly introduced.
  • The "Selenium" sub task was moved into a parallel job.
  • The time spent waiting for a Jenkins executor is not included.
  • The time spent setting up the repos (Zuul cloner), loading dependencies (Castor, npm-install), etc, is included in the total but not in the subsequent columns.

Zuul does emit timing per pipeline/job/status. For the view over 3 months:

https://graphite.wikimedia.org/render/?width=930&height=464&_salt=1571078477.436&from=-3months&connectedLimit=&target=zuul.pipeline.gate-and-submit.job.wmf-quibble-core-vendor-mysql-php72-docker.SUCCESS.mean&lineMode=connected&yUnitSystem=msec&yMin=0

The raise on 7/26 or 7/27 is the job that started running Selenium tests for each repositories.

I think the big spike around 08/22 - 08/27 was the introduction of the LocalisationCache as a service (T231200).

The last drop on 10/07 is the LocalisationCache service patch being introduced again 043d88f680cf66c90e2bdf423187ff8b994b1d02 and it comes with an in memory cache for l10n when using tests. That is a noticeable improvement.

The drop on 10/01 is that Selenium tests have been moved out of this job to a standalone one: wmf-quibble-selenium-php72-docker

One should be able to look at those per job metrics via https://grafana.wikimedia.org/d/000000283/zuul-job

A possible improvement is T235390. While the seccheck job is pretty fast, it consumes roughly 1.5GB of memory for the average extension (way more for core, where it still has to be enabled), and this usage will increase with future phan versions. Merging the two jobs would mean saving that amount of memory and hence freeing up CI resources. I don't know whether that would have a noticeable impact on the runtime, though.