- https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55/339/console
- https://integration.wikimedia.org/ci/job/composer-hhvm/187/console
- https://integration.wikimedia.org/ci/job/mwext-Echo-testextension-php55/18/console
- https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm/50030/console
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| contint: lower tmpfs from 512MB to 256MB | operations/puppet | production | +2 -2 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Duplicate | None | T126615 MySQL down on integration-slave-trusty-(1020|1021) | |||
| Resolved | hashar | T126545 CI trusty slaves running out of memory | |||
| Resolved | Andrew | T126557 Bump labs quota for 'integration' project | |||
| Resolved | hashar | T126594 Disable HHVM fcgi server on CI slaves | |||
| Resolved | ori | T126658 /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so no such file or directory |
Event Timeline
+1 thank you (and then we can create more nodes if that reduction in slots is too much)
The new Trusty slaves have been created with ci.medium flavor which has 2 CPU and 2GB RAM. They have been pooled with 2 executors. So we have 2GB RAM being shared by the system and up to two jobs. Turns out it is not enough.
From a quick chat with Timo, seems we will want to spawn way more instances of that type and only have one executor per instance. The one executor = one instance is how Nodepool instances are setup. That will also get rid of an issue we have when somehow jobs are allocated on an instance by Gearman but the job can't run because it is throttle to one per node.
So in short:
- add six more slaves
- change the six we have created recently to have one executor instead of two
The setup doc @Krinkle wrote is https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup#integration-slave-.7Btype.7D-XXXX
No mistake @Krinkle. I really thought 2GB would be enough to run two of our jobs in parallel .... :} salt/puppet etc randomly kicking in must consume what is left and ends up killing the memory :(
I have changed the six slaves we have created to have only one executor.
We are out of quota though:
| Cores | 81/85 | <--- |
| RAM | 151552/204800 | ok |
| Instances | 29/29 | <--- |
For 8 more executor slots we would need 8 ci.medium or:
| Stuff | ci.medium | 8 of them |
|---|---|---|
| CPU | 2 | 16 |
| RAM | 2GB | 16GB |
| Disk | 40GB | 320GB |
We would need:
- CPU quota raised to 97 (85 + 12)
- instance quota raised to 37 (29 + 8)
Not sure whether labs infra can handle it.
Additionally we have a couple tmpfs systems:
| /var/lib/mysql | 256 MB |
| /mnt/home/jenkins-deploy/tmpfs | 512 MB |
That does not help for the ci.medium that only have 2GB ...
Depooling them from Jenkins
Change 269880 had a related patch set uploaded (by Hashar):
contint: lower tmpfs from 512MB to 200MB
I have disabled puppet, cherry picked the patch for tmpfs to 128MB and pooled back integration-slave-trusty-1009 and integration-slave-trusty-1010 with a single executor.
Now have to monitor them and see what happens.
All slaves now have 128MB tmpfs instead of 512MB. I pooled back the various ci.medium slaves we have created yesterday.
@hashar since your getting rid of those instances does that mean the load will get high again or will they be replaced with ones that have more memory.
The CI slaves we added yesterday do not have enough memory. An example of Linux triggering the OOM:
[Thu Feb 11 16:59:51 2016] Killed process 27671 (php5) total-vm:1184088kB, anon-rss:765928kB, file-rss:920kB
That doesn't fit.
I am thus de pooling and deleting ALL the ci.medium instances we created yesterday. 2GB is not enough for MediaWiki related tests.
Fixed by deleting all c1.medium instances. There are still blocker but they are not really blocking anymore since the instances are gone :-}
This is definitely solved. Here is the summary:
On Feb 10th we have pooled ci.medium instances that only have 2GB of RAM. That was to accommodate the large shift of jobs from Precise to Trusty for php55 (see T126423).
We noticed a bunch of issue but eventually I depooled them at midnight and went to bed.
The day after, Feb 11th, during european business hours I have kept the slaves around to investigate/monitor/take trace/swear. Eventually around 17:30 UTC I have depooled and deleted them all.
I then created a bunch of m1.large slaves (8GB RAM, 4CPU). I have finished the provisioning after dinner. Encountered a minor issue but at 21:30 UTC all six new m1.large slaves are operational.
UbuntuTrusty label in Jenkins for the last 24 hours:
We went from 32 executors to 56 :-}
