Page MenuHomePhabricator

In CI, only run tests for code that is affected by a given change
Open, Needs TriagePublic

Description

Currently, we run all (core) test for every iteration of every change on gerrit. This takes on the order of 30 minutes. If we had a way to run only the tests for classes that is affected by a given change, that would save us a lot of time during development. (We'd still want to runn *all* the tests at some point, of course, just to be sure - at least when cutting a deployment branch).

For phpunit, this could be achieved by tracking the dependencies between classes (T351336): When calls X is changed, we need to run only the tests for all the classes that depend on X (directly or indirectly).

Event Timeline

Currently, we run all (core) test for every iteration of every change on gerrit. This takes on the order of 30 minutes.

Sorry to nitpick, but I think it's closer to 17 minutes nowadays (e.g. https://gerrit.wikimedia.org/r/c/mediawiki/core/+/972379/32#message-4c2aa7a596758a591510e1efbf28b88dcb46a57f). (See also T225730: Reduce runtime of MW shared gate Jenkins jobs to 5 min.)

I am unsure about doing this in CI; I'd worry that we end up missing running a test and we find out about it when something breaks in production.

However, I would personally find the ability to run a subset of tests locally, before pushing a patch, to be quite beneficial.

Oh, nice! Thank you for pointing that out!

But we are not running tests for all extensions, partially because they take too long. If we had a smart way to pick which tests to run, we could enable more.

I am unsure about doing this in CI; I'd worry that we end up missing running a test and we find out about it when something breaks in production.

We should definitly still run all the tests when making the deployment branch. We could also still run all tests as a gate submit job. But we could get feedback more quickly for each new patch set.

This has been discussed extensively many times, with lots of discussion for the benefits and costs; this task should probably be merged into one of the existing (generally Declined) tasks and that re-opened. T42008 is the oldest one I can find quickly – should we merge to that?

This has been discussed extensively many times, with lots of discussion for the benefits and costs; this task should probably be merged into one of the existing (generally Declined) tasks and that re-opened. T42008 is the oldest one I can find quickly – should we merge to that?

That task doesn't seem to be about the same thing - we shold indeed run core tests for changes made to extensions. The question of "dependencies" through hooks is interesting this this context, though. For now, I'd be happy with just havign this ability for core, so we don't have to worry about hooks.

I remember that we have discussed this before, but I don't remember when and where exactly. In the past, it would have been very difficult to determin what code is effected by a change. But with type hintes everywhere, this has become much more feasible.

This has been discussed extensively many times, with lots of discussion for the benefits and costs; this task should probably be merged into one of the existing (generally Declined) tasks and that re-opened. T42008 is the oldest one I can find quickly – should we merge to that?

That task doesn't seem to be about the same thing - we shold indeed run core tests for changes made to extensions. The question of "dependencies" through hooks is interesting this this context, though. For now, I'd be happy with just havign this ability for core, so we don't have to worry about hooks.

I remember that we have discussed this before, but I don't remember when and where exactly. In the past, it would have been very difficult to determin what code is effected by a change. But with type hintes everywhere, this has become much more feasible.

OK, will give a quick answer here.

Ignoring setup time (which doesn't vary based on which subset of the tests we run), the tests that take up the largest part of CI wall-time (as experienced by devs) is running Wikibase's database and browser tests. For example, on your latest patch:

From the main test job:

  • PHPUnit unit tests (core + gated extensions) took 11s
  • PHPUnit integration (core + gated extensions) tests took 2m 6s (Wikibase is most of the warned-about 'slow' tests, but no split-out)
  • PHPUnit database (core + gated extensions) tests took 10m 53s (Wikibase is most of the warned-about 'slow' tests, but no split-out)
  • QUnit JavaScript (core + gated extensions) tests took 26s

From the browser test job:

  • Browser tests (core) took 35s
  • Browser tests (gated extensions) took 7m 41s (Wikibase is ~1m 43s of this)

Given the monolithic nature of MediaWiki, I don't think it's reasonable to think we can determine as an expert human (much less through a heuristic) what will and won't affect Wikibase (or any other sub-system).

As an alternative to this proposal, I'd suggest splitting the PHPUnit DB tests out to their own job, and so running in parallel. (This is what we did when the browser tests were ~60% of our main test job).

Given the monolithic nature of MediaWiki, I don't think it's reasonable to think we can determine as an expert human (much less through a heuristic) what will and won't affect Wikibase (or any other sub-system).

That's the point of T351336: With good dependency analysis, we can indeed figure this out. Not with 100% certainty, but hopefully well enough to allow us to run per-patch-set tests more quickly.

Given the monolithic nature of MediaWiki, I don't think it's reasonable to think we can determine as an expert human (much less through a heuristic) what will and won't affect Wikibase (or any other sub-system).

I think this is true, but a heuristic could make a guess that's mostly right. Even with a less sophisticated system than a full dependency graph.

As an example, a change to CSS could skip database tests.

But the key question is: would that speed up overall test times (for a subset of patches)? And, if so, would it result in broken builds when we run the full test suite?

Given we have a low tolerance of broken builds—is it worthwhile investigating limiting the number of tests that run per patch be a good way to speed-up build times?

My reading of your reply is: not really :)

I think @hashar has also given this some thought (in his investigations with bazel).

As an alternative to this proposal, I'd suggest splitting the PHPUnit DB tests out to their own job, and so running in parallel. (This is what we did when the browser tests were ~60% of our main test job).

I like that. One thing that I think about is the setup costs of each job. Would lowering setup cost (e.g., cloning, fetching dependencies, setting up resources) make splitting jobs easier?

Given the monolithic nature of MediaWiki, I don't think it's reasonable to think we can determine as an expert human (much less through a heuristic) what will and won't affect Wikibase (or any other sub-system).

I think this is true, but a heuristic could make a guess that's mostly right. Even with a less sophisticated system than a full dependency graph.

Yes, certainly for local testing it'd be great for e.g. the PHPUnit test runner to have a command that speculatively only runs a few dozen tests rather than thousands, even if it's imperfect.

As an alternative to this proposal, I'd suggest splitting the PHPUnit DB tests out to their own job, and so running in parallel. (This is what we did when the browser tests were ~60% of our main test job).

I like that.

Filed as T351426.

One thing that I think about is the setup costs of each job. Would lowering setup cost (e.g., cloning, fetching dependencies, setting up resources) make splitting jobs easier?

Yes, speeding up the clone step would be brilliant. For instance, on the selenium test above:

  • Clone the relevant code: 1m 36s
  • Install composer dependencies: 7s
  • Install npm dependencies: 6s

I think your work on making dependency installation fast has been brilliant, but the actual fetching from gerrit is our significant road bump now.

I think your work on making dependency installation fast has been brilliant, but the actual fetching from gerrit is our significant road bump now.

Filed as T351466.

But the key question is: would that speed up overall test times (for a subset of patches)? And, if so, would it result in broken builds when we run the full test suite?

That could be prevented by running all tests as a gateway job before merging.

I don't think the purpose of this task can be achieved with our current state even though T351336 is appealing. You would need to build a graph of the code base then identify which areas of the code are affected which can prove to be challenging (and I am not mentioning MediaWiki hooks or integration tests with extensions).

There are build system to do that, one I have been contemplating is https://bazel.build/ which is the Open Source version of Google build system. I have been using it for Gerrit deployment and, as a end user, it has some steep learning curve but eventually ends up being pleasant to use. If I go hack a single java file or a js file, in any case I just run bazel test //.. and it will rebuild what is affected and only run tests affected by the code change, so in most case I have a feedback in a few seconds.

Of course Bazel has no support for PHP, and it requires the dependencies to be defined manually or using an extra tool which analyses your code to generate the dependencies. I played a bit with it but that is definitely a serious engineering effort which from day 1 needs to be set on the right track and would definitely requires cross collaboration with upstream.

Overall I think that is related to the behemoth task T225730. In the ideal world I would gather a few people face to face for say a week long to write down the problem statement and start establishing a rough plan then get it funded via the annual planning and dedicated more than a couple persons to work solely on those problems. Then I don't know, I have trouble staying focused on a given topic, short of say taking a 6 months sabbatical to work just on that topic ;)


Side track to fail faster:

  • if one sends a patch affecting some test files, we should run them first (aka anything touched under test/phpunit/*)
  • we could then run test files associated with code files that got changed (includes/Permissions/SimpleAuthority.php has:
    • unit tests at tests/phpunit/unit/includes/Permissions/SimpleAuthorityTest.php
    • possibly integration tests at tests/phpunit/integration/includes/Permissions/SimpleAuthorityIntegrationTest.php

So that at least if something fails, it fails fast. Else run everything else.

Another issue is extensions have dependencies upon other extensions and once all got cloned, we blindly run every single tests.

James added @group Standalone in T225068 which if I recall was to prevent running the 3 minutes long Scribunto integration tests when one was sending a patch to eg Wikibase (which depends on Scribunto, but those standalone tests are not affected by Wikibase). So it is merely an opt-in solution to this task proposal.

For integration tests between extensions, I don't have any idea. We need a way to express an integration tests between A and B and have it triggered when patches are send for both and skips running unrelated tests. Which I guess requires a DAG, turns out that is what Bazel does.

Basic parallelisation has already shown to reduce MW CI jobs to take 2-5min instead of 15-20min (see subtasks of T225730). That might be worth investing in first or instead before a signficantly more complex to build, maintain, and understand system is put in place (and the barriers that inevitably creates, plus the inherent risks it adds).

One of my quesions would be how fast it is to update this analysis when working on a commit stack. If the old analysis is applied or if fresh analysis takes longer than 1min, it might not be worthwhile.

But maybe the aim is for local development? Personally, I tend to run tests for only the component being worked on (i.e. one core directory, or one extension directory) and then rely on CI for the rest. After T225730, I'd probably run at least core fully locally if it only takes ~2min (baremetal, not a VM).

Another question would be how it deals with runtime variation. For example, some dependencies may be based on runtime configuration (selecting which implementation of a service to use) and hooks (involving significantly more code than might appear from a static analysis perspective). This abstracting and hiding of dependenices is intentional and desirable from the perspective of reducing coupling, but would be counter-productive from a test coverage perspective.