Page MenuHomePhabricator

For contintcloud either add RAM or Swap to the instances
Closed, ResolvedPublic

Description

We have an issue with the MediaWiki tests when running them on the 4GB RAM instances. We would need either RAM or Swap to be added to the Nodepool image.

Project Name: contintcloud
Private Flavor: 6G RAM / 2 vCPU / 40 Disk


context

Running the long PHPUnit tests on Trusty / Zend php 5.5: tests fail to execute external commands and get: fork() - cannot allocate memory. That happens when php memory usage is roughly at 1900MBytes (maybe before that).

PHP invokes fork() which on Linux is clone(). It uses copy-on-write to avoid duplicate memory pages, but Linux Virtual Memory manager still verifies whether the forked process would be able to consume all the RAM its allocated. The host already has 2500Mbytes+ used, and asking for 1900 Mbytes more is an ENOMEM.

Nodepool uses a m1.medium flavor: 2 vCPUs 40G disk and 4G RAM

We have no straightforward way to add a Swap partition in the image that is created. I have been digging in the diskimage-builder utility but haven't found any easy way to do so.

Request

The low hanging fruit would thus be to create a new flavor with 6GB RAM. Since the issue only occurs on Zend 5.5 / Trusty, in Nodepool we would only change the snapshot-ci-trusty then rebuild the snapshot. Eg: 6G RAM / 2 vCPU / 40 Disk.

Alternatively, instead of adding more RAM, we could create a flavor with less disk (40G -> 35G) and a 5G Swap. But I have no idea how the boot would manage to find it though, potentially labelled disk.swap. Can be experimented later on.

Event Timeline

hashar created this task.Apr 4 2017, 5:03 PM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)Apr 4 2017, 5:08 PM
Addshore removed a subscriber: Addshore.Apr 4 2017, 6:46 PM
Krinkle removed a subscriber: Krinkle.Apr 5 2017, 2:27 AM
chasemp added a subscriber: chasemp.Apr 5 2017, 6:42 PM

Using swap seems like a non-starter unless we are totally devoid of RAM capacity in which case we have a lot more problems. I don't want to start conflating IO load with RAM issues on CI and it will only hurt the tests themselves. For the record we do not overallocate RAM as it has caused nothing but issues in the past.

A few questions:

  • Why did this start happening? What changed? (Best I can find is https://phabricator.wikimedia.org/T125050#3140483 which seems recursive)
  • Do we need 40G of / for these instances and why?
  • Do we have the ability to target certain flavors at certain tests? (as it seems a bit naive on our part to say that the only available instance size is the one big enough to run our biggest test which will also be used to run our smallest)
hashar added a comment.Apr 5 2017, 7:59 PM

Using swap seems like a non-starter unless we are totally devoid of RAM capacity in which case we have a lot more problems. I don't want to start conflating IO load with RAM issues on CI and it will only hurt the tests themselves. For the record we do not overallocate RAM as it has caused nothing but issues in the past.

Why did this start happening? What changed? (Best I can find is https://phabricator.wikimedia.org/T125050#3140483 which seems recursive)

As code and tests are being added, we eventually approached a limit when there is sometime not enough memory to allow for the fork() to happens. Adding swap solve that issue since Linux would have a guarantee that the forked process would eventually be able to allocate the whole copy on write memory.

Definitely agree that in case swap ends up being (and surely it will) the IO would be concerning. So I guess we can dismiss the idea of adding swap entirely.

Do we need 40G of / for these instances and why?

That has been set arbitrarily, usage on idling instances is 3G / 41G for Trusty and 4.8G / 41G for Jessie. So I guess we can go from 40G to 10G and we will be fine.

Do we have the ability to target certain flavors at certain tests? (as it seems a bit naive on our part to say that the only available instance size is the one big enough to run our biggest test which will also be used to run our smallest)

Yes that is definitely doable in Nodepool.

We would create a new snapshot image that uses the same disk image (image-ci-trusty) but require a different flavor that has more RAM. Then in Nodepool define a new Jenkins label attached to that new instance with some min-ready, and migrate the Jenkins job to that new label.

Given that the problem seems to be just the algorithm memory check rather than actually needing that extra memory, this has a really simple solution:

echo 1 > /proc/sys/vm/overcommit_memory
hashar added a comment.EditedApr 5 2017, 8:42 PM

Oops I forgot about overcommit_memory which I mentioned on the parent task. From T125050#3153574 :

That is probably the easiest path for that fork issue indeed and keep the labs instance RAM at a sane level. The points above about lowering the disk size still stand though. I would give it a try and follow up on T125050

Oops I forgot about overcommit_memory which I mentioned on the parent task. From T125050#3153574 :

That is probably the easiest path for that fork issue indeed and keep the labs instance RAM at a sane level. The points above about lowering the disk size still stand though. I would give it a try and follow up on T125050

FWIW this currently defaults to 2 I believe

2	-	Don't overcommit. The total address space commit
		for the system is not permitted to exceed swap + a
		configurable amount (default is 50%) of physical RAM.
		Depending on the amount you use, in most situations
		this means a process will not be killed while accessing
		pages but will receive errors on memory allocation as
		appropriate.

		Useful for applications that want to guarantee their
		memory allocations will be available in the future
		without having to initialize every page.
hashar added a comment.Apr 6 2017, 2:18 PM

vm.overcommit_memory=1 fixed the fork issue (T125050). Hence there is no need to add swap or 2GB RAM to the instance.

I guess we can repurpose this task toward reducing the disk size ? Else we can close this.

chasemp closed this task as Resolved.Apr 6 2017, 2:20 PM
chasemp claimed this task.

Let's visit the disk usage issue later if needed, I don't want to over optimize this and create more problems without a reason. It only made sense if we were already detouring that direction.