Page MenuHomePhabricator

Add some integration executors to spread the load
Closed, ResolvedPublic

Description

CPU load on integration machines is high. It has been high during periods of high use for a long time.

Let's lower the number of executors per node and add new nodes to see if that helps, we already have needed quota.

Current:

  • 18 nodes (currently)
    • 8vCPU
    • 24GB Ram
    • 110 Disk
  • 4 executors per node/node

Remaining quota:

  • 104 vCPU
  • 286 GB Ram

What I'd like to do:

  • 24 nodes
    • 8vCPU
    • 24GB Ram
    • 110 Disk
  • 3x executors/node

Needed (additional) capacity:

  • 48 vCPU ✅
  • 144GB Ram ✅

Details

Other Assignee
brennen

Event Timeline

I'm making integration-agent-docker-(1060-1062)
Brennen is making integration-agent-docker-(1063–1065)

Mentioned in SAL (#wikimedia-releng) [2025-03-20T21:41:39Z] <brennen> integration: launched integration-agent-docker-106{3,4,5} (T389554)

Mentioned in SAL (#wikimedia-releng) [2025-03-20T22:50:38Z] <brennen> integration: added jenkins nodes for integration-agent-docker-106{3,4,5} with 3 executors per each (T389554)

How long is a long time?

Looking at the last 30 days (all we have) at least that long. But looking at the pressure stall: that's getting worse.

It looks like these are picking up jobs. Will have to monitor and make sure they don't blow up in some unusual way.

thcipriani claimed this task.

Whilst I failed to log mine (unlike @brennen :D), mine are also launched.

I also lowered the number of executors on all other instances to 3.

We have the same number of executors, but spread across more machines now. Updated our instance creation runbook to say Docker agents get 3 executors.

Calling this complete.

Mentioned in SAL (#wikimedia-releng) [2025-03-20T23:31:41Z] <bd808> integration: thcipriani added integration-agent-docker-106{0,1,2} earlier today (T389554)

I have checked Jenkins built in graph for the Docker label and I can barely see a raise in number of executors. The graphs shows we had 68 and now 73. Brennen and Tyler each created three instances for a total of six we should be a total of 24 more executors.

Am I confused and those new instances were created as replacement of old instances that have been deleted?

https://integration.wikimedia.org/ci/label/Docker/load-statistics , the green line is the number of online executors.

jenkins_executors_label_docker.png (519×627 px, 174 KB)

Am I confused and those new instances were created as replacement of old instances that have been deleted?

I also lowered the number of executors on all other instances to 3.

At the same time 6 new nodes were added the per-node concurrency was lowered from 4 to 3. The intent was to lower per-cpu core utilization across the group by adding more cores and reducing the number of parallel jobs that could hit each core.

At the same time 6 new nodes were added the per-node concurrency was lowered from 4 to 3.

Somehow I missed that :)

The intent was to lower per-cpu core utilization across the group by adding more cores and reducing the number of parallel jobs that could hit each core.

I think that should reflect in less pressure being put on the node. When we worked on the parallelization of PHPUnit tests CPU pressure on Node was my concern, I have discovered [ https://docs.kernel.org/accounting/psi.html | Linux Pressure Stall Information ] which is a fine grained version of old Load Average measurement. I did add a graph for it a while back and yesterday I have added one so we can easily find the aggregated CPU pressure.

Tentatively:

If spreading the load is a thing, we should see less pressure. Below is a view over 8 day. Monday to Thursday for a reference baseline, Monday and Tuesday this week seems to have less pressure:

integration_agent_cpu_pressure_8days.png (586×921 px, 110 KB)

There is also less IO and memory stalling (= all non-idle tasks stalled on either resource)