Raise quota for integration project
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Sep 20 2019, 5:06 PM

Description

Project Name: integration
Type of quota increase requested: cpu/ram/instances
Amount of quota increase: more! (to be discussed)

Reason:

The integration WMCS projects hosts instances that run CI jobs. It has a large consumption and would typically spike on CPU usage during busy hours (european evening/west coast morning). But you all know about that already.

We could use a slight quota increase to have more executors available and thus be able to run more jobs in parallel. I have been holding that request until:

we dropped php7.0/php7.1 support for mediawiki master branches since they consumed a lot of resources.
I started rebuilding the fleet to use instances with less ram (from 32G to 24G) T226233

We also had some small dedicated instances added which go against the quota, though they are not always busy:

integration-trigger is a workaround due to Zuul limitations and just runs idling jobs that trigger actual jobs on other instances. It is just 1 cpu/2G RAM
integration-agent-puppet-docker-10018 vCPUs / 24G RAM which only run jobs for operations/puppet repository

The bulk of the instances are mediumram flavor:

vCPUs	8
RAM	24G
Disk	80G

They each can run up to four jobs concurrently each could potentially use more than 1 CPU (eg a mediawiki run would need cpu for chrome / mysql / mediawiki).

If there is any capacity available on WMCS, it would be great to raise the quota to allow some more mediumram instances. Given they will be heavily used at some period of the day and might end up putting to much stress on the WMCS infrastructure. I am willing to receive as much quota increase as possible, but do not want WMCS infra to die as a result.

So I guess the easiest is to chat about it?

Related Objects

Mentioned Here: T232646: Move integration-castor03.integration.eqiad.wmflabs to a newer cloudvirt machine
T226233: Rebuild integration-slave-docker-* instances to use less RAM, new name and Stretch

Event Timeline

hashar created this task.Sep 20 2019, 5:06 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2019, 5:06 PM

Jdforrester-WMF subscribed.Sep 20 2019, 6:50 PM

To add some more data/justification for this task: the number of Gerrit repositories has grown:

gerrit-project-count.png (480×640 px, 5 KB)

The number of commits has grown:

gerrit-new-patchsets-year.png (437×1 px, 21 KB)

We got a spike of mediawiki/core patches right after the hackathon due to ¯\_(ツ)_/¯:

jenkins-builds-mw-core.png (480×640 px, 7 KB)

This spike coincided with some long runtimes for a few repos, which slowed everyone working on code since every test job we run is using the same pool of a limited number of workers:

^ this graph is *just jenkins* zuul waiting for a worker + waiting for dependent changes to merge was even longer during that time period:

gate-and-submit-resident-time-2019-08-22_2019-08-24.png (331×700 px, 39 KB)

The combination of more repos + more patches + more jobs using the same pool of workers we had a few months ago led to lots of backup in the system.

We're running close to capacity, the late August slow downs would have been mitigated (somewhat, although there were/are other issues) by additional capacity.

Current quota/utilization:
22 / 37 instances. 125 / 130 VCPUs. 434.0 GB / 458.984375 GB RAM.

Similarly large projects:

deployment-prep: 77 / 100 instances. 191 / 200 VCPUs. 366.0 GB / 415.203125 GB RAM.
tools: 152 / 158 instances. 603 / 625 VCPUs. 1206.0 GB / 1281.73828125 GB RAM.

The relatively large RAM usage by the mediumram and bigram instances in the integration project changes how things can fit into our cloudvirt hosts (hypervisors). RAM is the only resource that we do not oversubscribe, so these instances "consume" more of the underlying host. The dashboard at https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?refresh=30s&orgId=1 gives an overview of the current OpenStack cluster. At the moment I'm typing this, our global CPU usage is 138% and our RAM usage is 44%.

@hashar / @thcipriani can you make a more concrete ask than "whatever you can spare"? How many more mediumram instances would you ideally like to have? How many existing instances (if any) can you get rid of? Do you have other wish list items for an ideal setup like "must be on SSD" or "must have 10G network"?

The relatively large RAM usage by the mediumram and bigram instances in the integration project changes how things can fit into our cloudvirt hosts (hypervisors). RAM is the only resource that we do not oversubscribe, so these instances "consume" more of the underlying host.

I remember those conversations which followed WMCS overflowing a few years ago. The root cause is that Nova never got taught how to use memory ballooning with libvirt/kvm (but it is available on other hypervisors). I blame OpenStack for defaulting that to ram_allocation_ratio=1.5 (50% over commit). But I digress.

The dashboard at https://grafana.wikimedia.org/d/000000579/wmcs-openstack-eqiad1?refresh=30s&orgId=1 gives an overview of the current OpenStack cluster. At the moment I'm typing this, our global CPU usage is 138% and our RAM usage is 44%.

That is a nice dashboard, very helpful. Discounting cloudvirt1014 and cloudvirt1028 which have more than 90% memory usage, the next four at ~ 70% memory usage still have ~ 100G free and could potentially fit 4 24G instances each. Overall there is 6.5TBytes of free memory free, which sounds like the resource is abundant.

The CPU oversubscription could be a concern though, if I got it right we oversubscribe by a ratio of 4 (cpu_allocation_ratio=4.0). It is usually fine assuming most VM are idling with some spike of CPU from time to time. With CI, the CPU demand kind of defeat the over subscription, an instance with 8vCPUs might well have 6 of them busy and that single instance could probably represent 25% of the real CPUs on the machine. But I guess that can be deal with by spreading those CI instances across the cloudvirt.

We still have 7 bigram instances for a consumption of 7 * 36 GB = 252 GB. Since they got created we have lowered the number of concurrent builds from 6 to 4 per instances since the jobs are typically CPU bound. That in turns mean we need less memory and since I am well aware that the whole memory of the instance is actually used on the cloudvirt I went to migrate them to mediumram (24G). That would save 7 * (36G - 24GB) = 84G (T226233), or down from 434GB to 350GB memory usage for the project.

The whole list:

Instance Name	VCPUs	RAM (GB)	Disk (GB)	Description
integration-agent-docker-1001	8	24	80	New, Stretch based T226233
integration-agent-docker-1002	8	24	80	"
integration-agent-docker-1003	8	24	80	"
integration-agent-docker-1004	8	24	80	"
integration-agent-docker-1005	8	24	80	"
integration-agent-puppet-docker-1001	8	24	80	Keep: for operations/puppet
integration-castor03	8	16	160	Keep: Central cache, to be migrated to a faster cloudvirt (T232646)
integration-cumin	1	2	20	Keep: Cumin
integration-puppetmaster01	1	2	20	Keep: Puppet master, or merge with cumin
integration-slave-docker-1048	8	36	80	Legacy, to be migrated to `mediumram` T226233
integration-slave-docker-1050	8	36	80	"
integration-slave-docker-1051	8	36	80	"
integration-slave-docker-1052	8	36	80	"
integration-slave-docker-1054	8	36	80	"
integration-slave-docker-1058	8	36	80	"
integration-slave-docker-1059	8	36	80	"
integration-slave-jessie-1001	2	4	40	Legacy for Debian packages
integration-slave-jessie-1002	2	4	40	"
integration-slave-jessie-1004	2	4	40	"
integration-trigger-01	1	2	20	Might be moved to contint1001, unsure about security though
saucelabs-01	2	2	40	For ruby daily browser tests
saucelabs-02	2	2	40	Idem, can probably be deleted

We can probably drop integration-slave-jessie-1004 and saucelabs-02, though they are rather small and offer some kind of redundancy for the jobs being build on them.

The bulk of integration-slave-docker are being rebuild to save 84G of RAM.

@hashar / @thcipriani can you make a more concrete ask than "whatever you can spare"? How many more mediumram instances would you ideally like to have? How many existing instances (if any) can you get rid of? Do you have other wish list items for an ideal setup like "must be on SSD" or "must have 10G network"?

Eventually we will have 12 mediumram instances able to run up to 48 jobs concurrently for a total usage of:

RAM	vCPU	Disk
288G	96	960G

I would like to aim at progressively doubling that pool, but stop earlier once the level of service is deemed good enough (ie the time jobs start fast enough / the queue stays reasonably low).

Then reviewing dashboards:

Max launch wait: https://grafana.wikimedia.org/d/000000321/zuul?panelId=33&fullscreen&orgId=1&from=now-30d&to=now
Gearman wait queue: https://grafana.wikimedia.org/d/000000322/zuul-gearman?panelId=10&fullscreen&orgId=1&from=now-30d&to=now

There is less pressure on CI right now due to other actions (dropping some old php versions, running slightly less tests etc). So I guess we should just add 4 mediumram instances (from 12 to 16) or an increase of:

vCPU	RAM
32	96

Given 84G of RAM are going to be saved by rebuilding the bigram machines. To add four instances I could use 32 vCPU and 12G RAM to be added to the quota.

"must be on SSD"

I think the jobs are typically CPU bound rather than disk IO bound, but I could check. Maybe some jobs could benefit from using SSD.

or "must have 10G network"?

I had the concern for the central cache, but found out its egress traffic was shaped, I have raised the traffic shaping value and there is no more issue for now :)

I put a new cloudvirt online yesterday, and boosted your quotas. If things get scheduled on hdd systems and you need them moved just let me know.

Thank you!

I have added four more mediumram instances (integration-agent-docker-[1013-1016])

bd808 moved this task from Inbox to Approved on the Cloud-VPS (Quota-requests) board.Oct 8 2019, 4:58 PM

	F30409724: gerrit-new-patchsets-year.png
	Sep 20 2019, 8:50 PM

	F30410266: gate-and-submit-resident-time-2019-08-22_2019-08-24.png
	Sep 20 2019, 8:50 PM

Raise quota for integration projectClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Raise quota for integration project
Closed, ResolvedPublic
Actions