Page MenuHomePhabricator

Bump quota of Nodepool instances (contintcloud tenant)
Closed, ResolvedPublic

Description

Nodepool is currently limited to 12 instances. I would like to get it raised to 20 instances.

That will let us migrate the Zend 5.5 / HHVM jobs that are currently running on Ubuntu Trusty. An example load is F4708299 (live link), which seems to indicate that 5 instances will cover it.

Adding a couple more to help with the contention we have observed during peak hours (SF morning / Europe evening) and reach a round number of 20 instances.

We have already deleted 9 m1.large instances from the pool of permanent slaves (T148183) and will be able to delete a couple more once the HHVM/PHP jobs are moved.

We spawn m1.medium which have:

RAM4GB
VCPU2
Disk40GB

The Nodepool limit (max-server) would be bumped from 12 to 20. On OpenStack side, the quota of instances has to take in account the automatic refresh of snapshot images or two more instances.

MetricCurrentNew
Nodepool max-server1220
OpenStack quota
Instances1522
RAM100G100G
VCPU4044

There might be concern with disk space consumption. Though as I understand it disk space is copy on write and not filled until the instance fill the disk.

Looking at the instances:

Trusty

FilesystemSizeUsedAvailUse%Mounted on
/dev/vda138G2.4G34G7%/

Jessie

FilesystemSizeUsedAvailUse%Mounted on
/dev/vda138G3.6G33G11%/

Related Objects

Event Timeline

hashar created this task.Apr 28 2016, 2:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 28 2016, 2:29 PM

Change 285957 had a related patch set uploaded (by Hashar):
nodepool: bump # of instances

https://gerrit.wikimedia.org/r/285957

Note, as we migrate jobs to run on contintcloud we will delete some instances from integrationtenant :-}

hashar added a subscriber: Andrew.Apr 29 2016, 3:52 PM

We have 14 Trusty instances on the historical CI https://integration.wikimedia.org/ci/label/UbuntuTrusty/ . With the quota bump on Nodepool we will migrate jobs to those disposable instances and hence be able to reduce the number of historical slaves.

So that quota bump is a short term spike in capacity usage but as the migration goes on the legacy instances will be removed.

Note: @Andrew is attending a conference.

hashar triaged this task as Normal priority.May 2 2016, 4:50 PM

There is no urgency for now. Nodepool has 2 Trusty for base pool and will spawn more based on the demand.

The limit of 20 instances is barely reached currently, so the quota bump can wait till @Andrew is back around :-)

Andrew claimed this task.May 17 2016, 9:13 PM

Jenkins is getting really slow now.

15-20mins for testing and merges.

Could you up the priority please. And possibly up how many instances please.

Since we converted many jobs to nodepool and doint have lots of instances causing jenkins to slow down.

Zuul (since 2.1.0-95) now measures the time for a build to actually start on a Node. That represents how long it took for Jenkins to assign the build to an executor, ie an instance to be available.

For the last 24 hours with Jessie in yellow, and trusty in blue:

Seems to show we need more nodes to accommodate the ongoing demand.

The Graph is at https://grafana.wikimedia.org/dashboard/db/releng-zuul?panelId=18&fullscreen

Mentioned in SAL [2016-06-23T13:36:22Z] <hashar> CI is slowed down due to surge of jobs and lack of instances to build them on ( T133911 ). Queue is 50 for Jessie and 25 for Trusty.

hashar reopened this task as Open.Aug 14 2016, 9:35 PM

Quota has been lowered from 20 to 10 instances in an emergency on July 4th since the labs infra had RAM issues. Bumping the quota back to what it used to be is T142877.

This task T133911 is to get the quota bumped past 20 instances so we can migrate the rest of the jobs from the integration project that host permanent slaves to the contintcloud project.

The original proposal was to bump from 20 to 40 instances. After discussion with Cloud-Services we would to have better facts to help scaling the pool. Metrics would help: T139771: Identify metric (or metrics) that gives a useful indication of user-perceived (Wikimedia developer) service of CI.

One sure thing, we need more than 20 instances to accommodate for the migration and burst of jobs during peak hours.

Change 285957 abandoned by Hashar:
nodepool: bump # of instances

Reason:
That is being discussed on Phabricator task.

https://gerrit.wikimedia.org/r/285957

hashar updated the task description. (Show Details)

self note to check the quota:

ssh labnodepool1001.eqiad.wmnet sudo -iH -u nodepool  nova absolute-limits

Change 322270 had a related patch set uploaded (by Hashar):
nodepool: bump max server from 12 to 20

https://gerrit.wikimedia.org/r/322270

Crafted the puppet patch and poked about it:


Hello,

I have created the puppet patch to bump the number of Nodepool instances from 12 to 20 leaving rate unchanged.
https://gerrit.wikimedia.org/r/#/c/322270/

From https://phabricator.wikimedia.org/T133911 the quotas to be bumped for the 'contintcloud' tenants are:

Instances:  12 --> 22 (20 instances + 2 snapshot)
vCPU : 40 -> 44 (22 from above x 2 CPU)

RAM is at 100G which is enough (need 88GB).

Not sure if Friday is a good option, would be nice to have that done early next week :-]

Talked with Chase about it. This week was not possible since there is Thanks Giving in the USA leaving little opportunity to monitor the impact on wmflabs infra. We will do it next Tuesday or Nov 29th.

Change 322270 merged by Andrew Bogott:
nodepool: bump max server from 12 to 20

https://gerrit.wikimedia.org/r/322270

hashar closed this task as Resolved.Nov 29 2016, 3:13 PM
hashar claimed this task.

Nodepool loaded the new configuration and the OpenStack quota have been bumped to match the new reality. If there is a trouble, the easiest is to lower max-server in the nodepool.yaml configuration file. Nodepool reads the file automatically, no need to restart it.