Page MenuHomePhabricator

[Infra] Investigate and explain benefits / drawbacks of parallelisation of browser / E2E tests at Quibble vs. extension level
Closed, ResolvedPublic

Description

The current Quibble CI jobs for "selenium" run the end-to-end / browser tests for each component in sequence:

	wmf-quibble-selenium-php74 (17927): 811
		Setup: 15
		Versions: 1
		Ensure dir: '/workspace/log': 0
		Zuul clone : 124
		Submodule update: /workspace/src: 8
		Install composer dev-requires for vendor.git: 9
		Start backends: <MySQL (no socket)>: 3
		Run Post-dependency install: 0
		Install MediaWiki: 4
		npm install in /workspace/src: 5
		Start backends: <ExternalWebserver http://127.0.0.1:9413 /workspace/src> <Xvfb :94> <ChromeWebDriver :94>: 0
		Browser tests: mediawiki/extensions/Wikibase: 0
		Selenium extensions/Wikibase: 205
		Selenium extensions/AbuseFilter: 37
		Selenium extensions/CheckUser: 26
		Selenium extensions/Cite: 80
		Selenium extensions/Echo: 17
		Selenium extensions/FileImporter: 17
		Selenium extensions/GrowthExperiments: 78
		Selenium extensions/Math: 13
		Selenium extensions/PageTriage: 38
		Selenium extensions/ProofreadPage: 37
		Selenium extensions/VisualEditor: 40
		Selenium skins/MinervaNeue: 23
		PostBuildScript: 31

To reduce the time taken for the job, we either need to reduce the time taken to test each extension, or we need to run the tests for the different extensions in parallel.

Explain the benefits and drawbacks of the different approaches.

Acceptance Criteria

  • A well-justified evaluation of the benefits and drawbacks to the different approaches to parallelising the browser tests in this Quibble job

Event Timeline

Prio Notes:

Impact AreaAffected
production / end usersno
monitoringno
development effortsyes
onboarding effortsno
additional stakeholdershopefully
ItamarWMDE renamed this task from Investigate and explain benefits / drawbacks of parallelisation of browser / E2E tests at Quibble vs. extension level to [Infra] Investigate and explain benefits / drawbacks of parallelisation of browser / E2E tests at Quibble vs. extension level.Sep 26 2024, 1:04 PM
ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.

When pulled into board, let's discuss what timebox should be given for this

There have already been some investigations into why the E2E tests themselves are slow (T234002) and suggestions for how to optimise the test jobs (T225730). Any work we to do parallelise the execution of tests should be considered as in addition to (rather than instead of) work that we do to keep the suites themselves slim and fast-running. Addressing slow CI jobs through parallelism runs the risk of providing an excuse not to do the hard work of optimising the test suites. That said, the optimisation work doesn't seem to be happening, and if the choice is between using more CI resources or waiting for very-low-priority test suite optimisation tasks to make it to the top of all our backlogs, my feeling is that the organisation benefits much more from the immediate reduction in test-cycle times than it suffers from the cost of the additional resources. (Whether, in fact, this means buying more computers at the end of the day, or whether this is "just" a load optimisation for our existing cluster is also not completely clear, though of course approaches that involve multiple parallel agents will duplicate setup steps)

After discussion with @kostajh, and following the work in T374001, we could think of at least five possible approaches:

Enabling parallelism in npm at the test-runner level (wdio.conf.js's maxInstances, the agent-local cypress-parallel implementation)
Pros:

  • easy to implement and in the hands of the extension developers
  • already enabled for some extensions (see T373999)
  • doesn't use any additional CI resources

Cons:

  • Is constrained by the CPU / IO of the CI agent - only limited parallelism possible
  • Race conditions or isolation issues between tests for the same extension may not parallelise cleanly

Enabling parallelism via Quibble at the extension-level, but still running on a single agent
This abandoned Quibble patch demonstrates what is meant by running the tests for different extensions in parallel.
Pros:

  • Might help avoid race conditions between tests of the same extension - those continue to run serially
  • Can be implemented entirely in Quibble - requires no support from extension developers
  • doesn't use any additional CI resources

Cons:

  • Extensions may have individual wiki configuration which would conflict if run in parallel
  • Extension test suites may put the shared test wiki instance in a state that causes tests of other extensions to fail
  • Is constrained by the CPU / IO of the CI agent - only limited parallelism possible

Enabling multi-agent parallelism / sharding at a test-suite level
See wdio sharding docs and/or the @badeball/cypress-parallel approach from T374001 (patch). Sharding would be enabled for each extension - the extension test suites would still be setup and executed linearly on each shard agent and each agent would launch the suite for all extensions.
Pros:

  • Scaling is no longer constrained by the maximum resources of a single CI agent
  • Provides the best out-of-the-box balancing of agent runtimes.

Cons:

  • Uses additional CI resources
  • For extensions with very few tests, running npm install might take as long as running the test suites themselves
  • Requires changes to Quibble and to each extension

Enabling multi-agent parallelism / sharding at an extension level
Modify Quibble to distribute different extensions' test suites to different parallel agents.
Pros:

  • Scaling is no longer constrained by the maximum resources of a single CI agent
  • Can be entirely implemented in Quibble - requires no change to the test suites
  • Each extension's test suite is run linearly and on a single agent - very easy to reproduce failures locally

Cons:

  • Uses additional CI resources
  • Extensions have very different test execution times. A naïve implementation may yield unbalanced shards
  • For extensions with very few tests, running CI agent / mediawiki setup is the dominating factor and this can be quite wasteful.

Move tests for specific extensions to their own CI jobs
Suggested in T287582, we could create separate jobs for running the test suites of specific extensions.
Pros:

  • Simple to understand and implement
  • Scaling is no longer constrained by the maximum resources of a single CI agent
  • Each extension's test suite is run linearly and on a single agent - very easy to reproduce failures locally
  • Can be entirely implemented in Quibble - requires no change to the test suites

Cons:

  • Uses additional CI resources
  • Requires manual, more-complicated job setup in Quibble / Jenkins

Conclusion
I find myself strongly drawn to the sharding approaches. These seem to provide most flexibility and room for future horizontal expansion without the complexity of running multiple tests in parallel on the same host / wiki. Sharding at the extension level is appealing for being relatively simple to implement, but sharding at the test-suite level provides better runtime balancing between jobs - the maximum runtime of any single job will be the critical path for the whole CI run, so we should pick an approach which has the best chance of minimising this.

Additional input based on conversation with @hashar in #wikimedia-releng (logs - 27.11.2024). @hashar expressed a wish not to add additional complexity to the current CI implementation by introducing new jenkins jobs (which speaks against the sharding approach), and also concerns about increasing the amount of concurrency (i.e. the load on the test cluster - similar concerns were shared about the PHPUnit parallel test implementation (T50217)).

As an alternative, @hashar suggested strategies for speeding up the existing jobs:

Run npm install in parallel
There is a feature to run the npm install in parallel for the extensions in the quibble job. It was deactivated because of a thread management bug (T303270) in the quibble implementation. Investigating / resolving the issue here and reactivating the feature would save some time.

Remove some tests / test suites from the wmf-quibble-selenium-php74 job
The purpose of the wmf-quibble-selenium job is to catch UX issues arising as a result of changes to core (and some selected extensions). In most cases, UX changes should be detected by the test suites of the extensions themselves before code is merged. @hashar suggested to move long-running test suites (e.g. Wikibase or Cite) to dedicated jobs that run on merge for the extensions.

Conclusion
I find myself strongly drawn to the sharding approaches. These seem to provide most flexibility and room for future horizontal expansion without the complexity of running multiple tests in parallel on the same host / wiki. Sharding at the extension level is appealing for being relatively simple to implement, but sharding at the test-suite level provides better runtime balancing between jobs - the maximum runtime of any single job will be the critical path for the whole CI run, so we should pick an approach which has the best chance of minimising this.

FWIW, I’m inclined to agree with the preference for sharding. I think I would lean towards the extension level rather than the test-suite level – to me it seems conceptually simpler, matches more closely what you might run locally, and I’m not as concerned about not getting the optimal runtime due to imperfect balancing.

Remove some tests / test suites from the wmf-quibble-selenium-php74 job
The purpose of the wmf-quibble-selenium job is to catch UX issues arising as a result of changes to core (and some selected extensions). In most cases, UX changes should be detected by the test suites of the extensions themselves before code is merged. @hashar suggested to move long-running test suites (e.g. Wikibase or Cite) to dedicated jobs that run on merge for the extensions.

That’s a fair point… this could perhaps be investigated more, but off the top of my head I don’t remember a case where a failing browser test from a different extension highlighted a real problem with a core or extension patch. (A quick look at git log */tests/selenium/ shows T378581, where a Wikibase browser test was broken by DiscussionTools, but IIUC in that case DiscussionTools’ CI was not affected.)

Super interesting stuff! Thanks for writing it up, @ArthurTaylor.

I was a bit surprised about the first two cons mentioned in the "parallelism via Quibble at the extension-level, but still running on a single agent" section in T374003#10356845. I haven't seen any browser test suites with custom wiki configuration, and can't think of any tests that would put the wiki in a state that would cause tests of other extensions to fail, but maybe I'm just lacking imagination.

Other than that I pretty much agree with the conclusion in T374003#10356845. Sharding at the test suite level seems simpler to me, since we don't have to do any clever balancing of extensions with longer-running test suites. I first thought that we'd lose too much time by running npm install in sequence for each suite, but that's probably not the biggest concern, and T374003#10360988 sounds like that could also be parallelized if need be.

Hi @Jakob_WMDE,

Thanks for the feedback. With troublesome tests, I'm thinking for example of the block user tests that used to cause trouble when running parallel inside the same extension because the same user being blocked was also being used for other tests. I can't think of a specific example right now, but I was imagining some test or extension configuration that manipulated some state associated with the Wiki admin user - for example, tests that have expectations about the state of Log / History pages. But yes - I don't actually have concrete examples.

Super interesting stuff! Thanks for writing it up, @ArthurTaylor.

I was a bit surprised about the first two cons mentioned in the "parallelism via Quibble at the extension-level, but still running on a single agent" section in T374003#10356845. I haven't seen any browser test suites with custom wiki configuration, and can't think of any tests that would put the wiki in a state that would cause tests of other extensions to fail, but maybe I'm just lacking imagination.

IPInfo and GrowthExperiments both modify LocalSettings.php at runtime, to enable testing more complex scenarios.

Thanks for also asking my opinion @karapayneWMDE

I'm not sure I have vastly more wise things to say about sharding than have already been said. I'd happily also stick my 2c in for the sharding approach as something we could try
I think I lean towards the simplicity of the out-of-the-box nature, and future scaling of the test-suite level sharding.

It seems like testing this as an experimental job might allay fears for infrastructure overload? Seems to me like going further with T374001 and turning on the experimental jobs or a similar approach for wdio would teach us something fairly quickly.

If I'm understanding the numbers above I also see some appeal for "Move tests for specific extensions to their own CI jobs" as a possible quick win. With Wikibase, Cite and GrowthExperiments such outliers in runtime chopping just these might provide the best speed up:effort ratio while acknowledging it won't auto balance and might require more manual config/babysitting in the future.

karapayneWMDE removed karapayneWMDE as the assignee of this task.