Page MenuHomePhabricator

Investigate if we can speed up PHPUnit ResourcesTest
Open, Needs TriagePublicSpike

Description

Background

Test platform main goal this quarter is to reduce the developer feedback time for the core CI below 10 minutes T420590.
We are almost there, We now have all the jobs running below <10 minutes except quibble-with-gated-extensions-vendor-mysql-php83.

Problem

In quibble-with-gated-extensions-vendor-mysql-php83 job, we have noticed that the ResourcesTest class takes
the most time to run by far, compared to other classes. Checking the cache results of a sample run of the gated extensions job, the top 3 slowest test classes are:

ClassDuration
ResourcesTest~ 420 sec
ParserTest_Cite_1_citeParserTests_Test~98 sec
ParserIntegrationTest~50 sec

When the PHPUnit test splitter distributes the classes to the available split groups, ResourcesTest gets assigned
to one split group. In this sample case case, the database section timings are like this:

GroupDuration
split_group_5505.297s
split_group_1215.208s
split_group_0211.558s
split_group_2200.622s
split_group_3193.049s
split_group_4141.554s
split_group_617.441s
split_group_71.395s

The ResourcesTest class is assigned to split group 5, together with 6 other classes. The below timings indicate how long the classes took inside that split group #5:

Test classDuration
ResourcesTest470.568s
ApiStructureTest20.383s
ContentHandlerFunctionalTest4.149s
PerformanceBudgetTest2.336s
ApiPrefixUniquenessTest0.231s
DatabaseIntegrationTest0.046s
DumpableObjectsTest0.032s

Because the split groups run in parallel, the time taken by the database PHPUnit stage is decided by the slowest split group.
The job has to wait for all split groups to finish before it can move on. This means the group that gets ResourcesTest is
disproportionally slower than the other groups
. Even though the other split groups are done, the job still has to wait for
the ResourcesTest group to finish.

Exploration

We have tried a few ideas to make this class run faster, but with our limited knowledge of the PHPunit tests, MediaWikiIntegrationTestCase and ResourceLoader, we
have not been able to land a solution that does not introduce a maintenance burden to allow continuity.

Examples of what we
have tried:

1279380, tests: reuse cache in ResourcesTest::testRespond

  • The class crawls through each of the resource loader modules fed as a data provider to a test case. Profiling show a lot of time is spent in Wikimedia\Minify\JavaScriptMinifier::minifyInternal. That is for JavaScript minification which is a CPU intensive, the result is cached in LocalServerObjectCache. The setup/teardown wipes the minimification cache.
  • The change proposes to keep a copy of the Cache and restore it via setService(). Peter pointed that also reset all services which might be a problem? See https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1279380/comments/3624ca64_af2e7a22

1284007 tests: reuse integration fixture in ResourcesTest::testRespond

  • This is an attempt to flag a service to prevent it from being reset by the MediaWikiIntegrationTestCase setup/teardown. Antoine feels it is a lot of added complexity.

1296081) WIP: Split ResourceLoader response structure tests by module

  • That split the large test to smaller batches. By balancing the tests to different split groups, the total feedback time is lowered. It is done by generating tests class rather than splitting them.
  • The cache is still wiped, thus the aggregated runtime is more or the same.

Goal

If ResourcesTest can run faster, it would help us make the jobs run faster. We are requesting input from anyone familiar
with ResourcesTest, MediaWikiIntegrationTestcase setup/teardown of services and the ResourceLoader module. We want to reduce ResourcesTest runtime while keeping its guarantees.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptApr 27 2026, 3:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1254960 had a related patch set uploaded (by Pwangai; author: Phedenskog):

[mediawiki/core@master] tests: batch testRespond

https://gerrit.wikimedia.org/r/1254960

Change #1279380 had a related patch set uploaded (by Hashar; author: Hashar):

[mediawiki/core@master] tests: reuse cache in ResourcesTest::testRespond

https://gerrit.wikimedia.org/r/1279380

Change #1254960 abandoned by Phedenskog:

[mediawiki/core@master] tests: batch testRespond

Reason:

Lets go with https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1284007 or https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1284585

https://gerrit.wikimedia.org/r/1254960

Change #1284007 had a related patch set uploaded (by Pwangai; author: Pwangai):

[mediawiki/core@master] tests: reuse integration fixture for ResourcesTest::testRespond

https://gerrit.wikimedia.org/r/1284007

Change #1286949 had a related patch set uploaded (by Pwangai; author: Pwangai):

[mediawiki/core@master] tests: reuse local object cache in ResourcesTest

https://gerrit.wikimedia.org/r/1286949

Change #1287037 had a related patch set uploaded (by Pwangai; author: Pwangai):

[mediawiki/core@master] WIP: Shard heavy class testRespond

https://gerrit.wikimedia.org/r/1287037

Change #1287435 had a related patch set uploaded (by Pwangai; author: Pwangai):

[mediawiki/core@master] WIP: testRespond save cache in memory

https://gerrit.wikimedia.org/r/1287435

Change #1287435 abandoned by Pwangai:

[mediawiki/core@master] WIP: testRespond save cache in memory

https://gerrit.wikimedia.org/r/1287435

When tests run in parallel, the ResourcesTest class is assigned to a single split group, creating a bottleneck. One working approach is to split its heavy testRespond cases evenly across the 7 available split groups. The parallelization algorithm then treats each split as an independent test class, which lets us distribute them across the 7 split groups weighted by execution time. This balances the load more effectively and produces an immediate performance gain from the initial >370 sec run time of the class:

ResourcesTest: 113.77377754114723,
ResourcesTestRespondShard0Test: 69.58335423999999,
ResourcesTestRespondShard1Test: 62.61069632000001,
ResourcesTestRespondShard2Test: 66.62046847999999,
ResourcesTestRespondShard3Test: 66.05744383999999,
ResourcesTestRespondShard4Test: 67.68996288000001,
ResourcesTestRespondShard5Test: 85.78878912,
ResourcesTestRespondShard6Test: 59.08599999999999,

It's good to note that these speed benefits only apply when tests run in parallel. Since CI always runs in parallel, this is fine in practice. No improvement is expected when running serially. Depending on how upstream paratests runs tests in parallel, when we eventually switch to upstream, we may do away with sharding or decide to keep it.
This approach saves ~3 min of run time from the target job (quibble-with-gated-extensions-vendor-mysql-php83 ) pushing it below the 10 min mark.

Another approach is reusing the integration fixture in ResourcesTest::testRespond. ResourcesTest generates thousands of data-provider rows covering various test cases, and each row currently performs an expensive rebuild and teardown of the full integration fixture. The solution is to reuse the MediaWikiServices/DB fixture as described in the patchset. This saves time on every test scenario and cumulatively trims roughly 3 minutes off total runtime, bringing the gated extensions job below the 10 minute threshold. The implementation remains valid after we migrate to PHPUnit 10's upstream paratest. The savings are visible in the split group timings.

Sample CI split group timings of this approach:

{
PHPUnit database split_group 6: 15.864475011825562,
PHPUnit database split_group 3: 110.49932599067688,
PHPUnit database split_group 7: 114.84924292564392,
PHPUnit database split_group 1: 160.41772389411926,
PHPUnit database split_group 2: 161.13036608695984,
PHPUnit database split_group 0: 181.93444919586182,
PHPUnit database split_group 4: 185.5525860786438,
PHPUnit database split_group 5: 191.25670099258423
}

Compared to sample normal run without this solution:

{
PHPUnit database split_group 6: 16.930068016052246,
PHPUnit database split_group 7: 129.78758096694946,
PHPUnit database split_group 2: 135.40559792518616,
PHPUnit database split_group 1: 162.2151710987091,
PHPUnit database split_group 0: 162.8203480243683,
PHPUnit database split_group 3: 180.08711004257202,
PHPUnit database split_group 4: 208.72650790214539,
PHPUnit database split_group 5: 492.2596230506897
}

The approach in T424462#11923115 is definitely attractive because of the impact on the split group durations. Besides that I don't have a strong preference between the two approaches (T424462#11923115 vs. T424462#11923247) - I think the first approach is easier to read and understand if you know that test suites are going to be run in parallel and split by class, but the second approach doesn't require any knowledge of the test-splitting system to make sense of the optimisations (though it does require pretty deep insights into the MediaWikiIntegrationTestCase and PHPUnit test suite lifecycle). The important thing for me is that we assume that this code will live in the repo for a while and need to be maintained - it might be another year or two before we switch to paratest.

We talked about the ResourcesTest class being rather slow some weeks ago.

One large contender is ResourcesTest::testRespond. That test has a data provider which feeds it every ResourceLoader modules defined by MediaWiki and its extensions. I did ran it with profiling and most of the time was spent minifying JavaScript, and the second contender was compiling .less stylesheets.

Because minifying is slow and the JavaScript files are referred to by multiple stylesheets/modules, we have the output cached in a LocalServerObjectCache instance (php shared memory?).

The test uses a data provider, PHPUnit thus generate a test per module. Each of those tests individually invokes MediaWikiIntegrationTestCase setup/teardown cycle which is:

  • slow (that is itself an Epic)
  • resets the LocalServerObjectCache

We thus loose the previously minified output and each test triggers redo what was previously done when they could have fetched it from the cache.

We need to preserve the cache between tests. I gave it a go at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1279380:

On my local machine Wikimedia\Minify\JavaScriptMinifier::minifyInternal went from 20,86 seconds down to 3.8 seconds thanks to the cache being restored. With core + Vector, testRespond went from 1m25s to 1m01s (reference).

@pwangai commented about the use of MediaWikiIntegrationTestCase::setService() https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1279380/comments/3624ca64_af2e7a22 . I am not familiar with those interfaces, hence we should ask people from MediaWiki engineering teams.

Change #1287037 abandoned by Pwangai:

[mediawiki/core@master] tests: Shard heavy class testRespond

Reason:

Introduces complexities

https://gerrit.wikimedia.org/r/1287037

Another idea we had tried was making ResourcesTest run its own split group. I took a sample gated-extensions job run, and went digging into the saved artifacts. I do not think we have a lot to gain by putting the ResourcesTest class in its own split group. For this particular run in the database section, the split group timings are like this:

split_group_5  505.297s
split_group_1  215.208s
split_group_0  211.558s
split_group_2  200.622s
split_group_3  193.049s
split_group_4  141.554s
split_group_6   17.441s
split_group_7    1.395s

The ResourcesTest class is assigned to split group 5, together with 6 other classes. The below timings indicate how long the classes took:

ResourcesTest                 470.568s
ApiStructureTest               20.383s
ContentHandlerFunctionalTest    4.149s
PerformanceBudgetTest           2.336s
ApiPrefixUniquenessTest         0.231s
DatabaseIntegrationTest         0.046s
DumpableObjectsTest             0.032s

ResourcesTest share is:

ResourcesTest time:  470.568 / 497.745s summed timed
Other class time:     27.177s

And the time ratios are consistent with other runs. Putting the class in it's own split group will save us ~30 seconds at most. This would not be a big diff, but 30 seconds is not enough to pursue this idea IMO, considering the job would still remain >10min mark.

I dug further into ResourcesTest::provideRespond() from the above data. Right now testRespond runs every ResourceLoader module against every installed skin. In the above job, the breakdown of the total 445 seconds taken by the ResourcesTest::testRespond is as follows:

vector-2022             80.654s
vector                  79.389s
minerva                 77.315s
contenttranslation      46.133s
json                    42.407s
authentication-popup    40.885s
apioutput               39.521s
fallback                39.206s

I guess my question would be if test coverage for all of these is important, or could we get away with limiting coverage to [ 'minerva', 'vector', 'vector-2022' ], or maybe what we could exempt?

I also want to add that this job is the job that uses the most % of CI time of all jobs (see https://releng-data.wmcloud.org/-/dashboards/ci-by-repo-and-job). Last month the CI spent 872.8 hours in that job, that is 12% of the total CI runtime per month. It would be great if we can find a way to make it use less CI time/faster.

Summary

Short summary from earlier comment at T424462#11959958

ResourcesTest:testRespond() is fed by a data provider yielding all the resource module that are defined by MediaWiki and extensions loaded in the CI job. That is several hundred of them.

The JavaScript snippets are minified, which is CPU heavy. The result is cached in LocalServerObjectCache which is wiped between each tests by MediaWikiIntegrationTestCase setup/teardown.

Fix

The fix I found in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1279380/3 is to save and restore that cache inside the test using setService():

static $srvCache;
if ( $svCache === null ) {
  MediaWikiServices::getInstance()->getLocalServerObjectCache();
}
$this->setService( 'LocalServerObjectCache', $srvCache );

@pwangai pointed setService() does reset all services and could add some overhead. I thus went with a second pattern replacing setService() by resetServiceForTesting() and redefineService():

if ( $srvCache === null ) {
    $srvCache = $services->getLocalServerObjectCache();
}
// using setService() resets all services which adds some overhead
$services->resetServiceForTesting( 'LocalServerObjectCache' );
$services->redefineService(
    'LocalServerObjectCache',
    static function ( MediaWikiServices $services ) use ( $srvCache ) {
        return $srvCache;
    }
);

Benchmarking

I ran it three time using mediawiki/core + Vector with and that second pattern is faster ( from https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1279380/comments/3624ca64_af2e7a22 )

Using time ./vendor/bin/phpunit tests/phpunit/structure/ResourcesTest.php --filter testRespond

Results

Baseline (HEAD^)

1m7,436s
1m10,087s

  • With setService()**

56,276s
55,000s
56,450s

With resetServiceForTesting() / redefineService()

54,506s
55,481s
54,867s

Conclusion

Thus resetting all services (as a side effect of using setService()) does cause some overhead compared to redefining a single service. It is not much but that piles up the more tests we run and testRespond is passed hundred of test cases so it matters a bit.

The question for MediaWiki devs is: is the pattern of using resetServiceForTesting() / redefineService() instead of setService() a valid in this case?

Another approach is reusing the integration fixture in ResourcesTest::testRespond. ResourcesTest generates thousands of data-provider rows covering various test cases, and each row currently performs an expensive rebuild and teardown of the full integration fixture. The solution is to reuse the MediaWikiServices/DB fixture as described in the patchset. This saves time on every test scenario and cumulatively trims roughly 3 minutes off total runtime, bringing the gated extensions job below the 10 minute threshold. The implementation remains valid after we migrate to PHPUnit 10's upstream paratest. The savings are visible in the split group timings.

This sounds like a good idea, but I'm trying to think of a simpler way to do it. If I understand correctly, we're paying the setup cost for each of the hundreds of test cases generated by the data provider. Could we also improve it by basically moving the nested loop over module names from provideRespond() to testRespond()? Or generating test cases that cover groups of (let's say) 50 modules, instead of just one module each?

I think that wouldn't interfere with the code under test (ResourceLoader normally handles multiple modules per request), it wouldn't make debugging failing tests much worse (we could indicate which module failed in the assertion message instead), and it would be easier to maintain in the long term than introducing special code to the test runner just for this one test.

I dug further into ResourcesTest::provideRespond() from the above data. Right now testRespond runs every ResourceLoader module against every installed skin. In the above job, the breakdown of the total 445 seconds taken by the ResourcesTest::testRespond is as follows:

vector-2022             80.654s
vector                  79.389s
minerva                 77.315s
contenttranslation      46.133s
json                    42.407s
authentication-popup    40.885s
apioutput               39.521s
fallback                39.206s

I guess my question would be if test coverage for all of these is important, or could we get away with limiting coverage to [ 'minerva', 'vector', 'vector-2022' ], or maybe what we could exempt?

Many of these don't need to be tested IMO. contenttranslation, json, authentication-popup, apioutput, fallback (the bottom 5) are not general-purpose skins; 99% of modules will in practice never be used with them. (For example contenttranslation is only used on Special:ContentTranslation, and apioutput is only used on human-readable api.php responses).

I think we can drop testing against all skins marked as "skippable" (see SkinFactory.php#73), which would include all 5 of those; except that we should keep testing against the fallback skin, since I'd like us to run the tests against a "basic" skin too.

By the way, it's surprising to me that running this test for Vector or Minerva takes almost 2x the time that the more "basic" skins use. Maybe this is something that could also be investigated.

Change #1279380 abandoned by Hashar:

[mediawiki/core@master] tests: reuse minify cache in ResourcesTest::testRespond

Reason:

Abandoning in favor of using a for loop I0fcdc3c5a8543feb1b029eb1361a70a18b89b920

https://gerrit.wikimedia.org/r/1279380