Page MenuHomePhabricator

Jenkins: In the integration project in WMCS, test out larger workers for CI
Open, MediumPublic3 Estimated Story Points

Description

An idea from the RelEng team day at the Dec 2024 offsite: would larger workers make concurrent PHPUnit runs faster?

Acceptance criteria:

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani set the point value for this task to 3.

Some notes/questions from discussion:

brennen moved this task from Backlog to Doing/Involved on the User-brennen board.
brennen updated the task description. (Show Details)

@kostajh @ArthurTaylor: We noted this in core/includes/composer/ComposerLaunchParallel.php:

	private const SPLIT_GROUP_COUNT = 8;

Since one of our goals is to see if this is faster with more CPUs, is there a way we could adjust this from an environment variable? That would presumably let us pass a value from Quibble (or other testing environments, manual experiments, etc.).

@brennen yeah - that should be pretty easy. I've created T383389 to track this. In case someone else gets to it sooner, I don't think it's especially complicated. Note that per T378797, you might get another 50% speed increase for PHPUnit just by balancing the split groups. That's on our (Wikidata) backlog, but we won't get around to that until we've taken a look at the Selenium job runtimes.

Mentioned in SAL (#wikimedia-releng) [2025-01-14T16:50:40Z] <brennen> integration project: create integration-agent-docker-1058 for T383254

Mentioned in SAL (#wikimedia-releng) [2025-01-14T17:14:23Z] <brennen> integration project: create integration-agent-docker-1059 for T383254

Wound up burning -1058 during setup process. Started over with -1059 which is current running. Will check https://tools-static.wmflabs.org/jenkins-build-stats/2025/01/ once we have some stats for the new box there.

Change #1112099 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[mediawiki/core@master] ComposerLaunchParallel: Configure split group count from environment

https://gerrit.wikimedia.org/r/1112099

Change #1112099 merged by jenkins-bot:

[mediawiki/core@master] ComposerLaunchParallel: Configure split group count from environment

https://gerrit.wikimedia.org/r/1112099

Change #1200415 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[integration/quibble@master] WIP: Parallel runs: Add a group count

https://gerrit.wikimedia.org/r/1200415

Change #1200415 abandoned by Brennen Bearnes:

[integration/quibble@master] WIP: Parallel runs: Add a group count

Reason:

Seems like this can just be done in Jenkins directly in web UI.

https://gerrit.wikimedia.org/r/1200415

Brennen and I exchanged about it yesterday. We currently have instances with 8 virtual CPUS, I based this on having four executors / parallel builds per Jenkins agent which is 2 vCPU per build. The idea was that one CPU would be used per PHPUnit and the other one for MySQL and for Selenium tests two CPUs would covered the browser + web service + MediaWiki.

Some effort was made to add parallelism to the Selenium tests and more recently PHPUnit tests. In MediaWiki the PHPUnit tests are split over 8 process and if we got 3 concurrent builds that is a demand for 3 executors × 8 process = 24 processes, I am not counting the pressure on the 3 instances of MySQL. The figure is similar or worse for Selenium tests.

A way to monitor it is to look Linux Pressure Stall Information, that is similar to the 1970's load average but split by CPU/memory/IO. For each of them it differentiates between when only some tasks are waiting for the resource (some) and the share of time during which all active tasks are waiting (full).

For the integration project, the Pressure Stall Information can be see on https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?var-project=integration

One can compare:

InstancevCPUsFlavorPSI Graph
integration-agent-docker-10578g4.cores8.ram24.disk20.ephemeral90.4xiopsGraph
integration-agent-docker-105916g4.cores16.ram48.disk20.ephemeral90.4xiopsGraph

1057, 8vCPU 3 executors:

1057_PSI_12hours.png (388×917 px, 99 KB)

1059, 16vCPU 1 executor:

1059_PSI_12hours.png (388×917 px, 86 KB)

I think at some point I had the agents to have a single executor to ease the scaling. We had them at 4 executors but there was contention/stalling/disk space issue and we tuned them down to 3 executors.

I don't have the necessary mathematics background in queue theory to prove what would be the right numbers, but surely we can experiment with different combinations or just mass scale the instances from 8 to 16 or 24 vCPU but that might add a lot of pressure on WMCS.

As a side note, I think it is possible to change the flavor of an instance which saves the hassle of having to provision one from scratch (note that as of July 2023 the ephemeral disk space was not resized by OpenStack: T340825, thus the target flavor should have the same ephemeral space as the original flavor).

Most instances have g4.cores8.ram24.disk20.ephemeral90.4xiops and 1059 has g4.cores16.ram48.disk20.ephemeral90.4xiops. That is double the memory usage (but IIRC WMCS has a lot of free memory) and double the vCPU usage (I don't know VMCS free capacity in that regard).