CI trusty slaves running out of memory
Closed, ResolvedPublic

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2016, 10:21 PM
Paladox set Security to None.Feb 10 2016, 10:21 PM
Paladox added a subscriber: Paladox.
greg added a subscriber: greg.Feb 10 2016, 11:03 PM

So... do we need to recreate those 6 new executers that Timo and Antoine made today?

So... do we need to recreate those 6 new executers that Timo and Antoine made today?

Just talked on IRC with @hashar. I made the mistake of assigning two executors to these slaves. You can lower them to 1 executor per node in Jenkins config. That should resolve this. The ci1.medium labs instance image was proposed in T96706 as for 1 executor, not 2.

greg added a comment.Feb 10 2016, 11:07 PM

+1 thank you (and then we can create more nodes if that reduction in slots is too much)

The new Trusty slaves have been created with ci.medium flavor which has 2 CPU and 2GB RAM. They have been pooled with 2 executors. So we have 2GB RAM being shared by the system and up to two jobs. Turns out it is not enough.

From a quick chat with Timo, seems we will want to spawn way more instances of that type and only have one executor per instance. The one executor = one instance is how Nodepool instances are setup. That will also get rid of an issue we have when somehow jobs are allocated on an instance by Gearman but the job can't run because it is throttle to one per node.

So in short:

  • add six more slaves
  • change the six we have created recently to have one executor instead of two

The setup doc @Krinkle wrote is https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup#integration-slave-.7Btype.7D-XXXX

No mistake @Krinkle. I really thought 2GB would be enough to run two of our jobs in parallel .... :} salt/puppet etc randomly kicking in must consume what is left and ends up killing the memory :(

greg added a comment.Feb 10 2016, 11:10 PM

No mistake @Krinkle. I really thought 2GB would be enough to run two of our jobs in parallel .... :} salt/puppet etc randomly kicking in must consume what is left and ends up killing the memory :(

https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3AIntegration%2FSetup&type=revision&diff=299023&oldid=298137

I have changed the six slaves we have created to have only one executor.

We are out of quota though:

Cores81/85<---
RAM151552/204800ok
Instances29/29<---

For 8 more executor slots we would need 8 ci.medium or:

Stuffci.medium8 of them
CPU216
RAM2GB16GB
Disk40GB320GB

We would need:

  • CPU quota raised to 97 (85 + 12)
  • instance quota raised to 37 (29 + 8)

Not sure whether labs infra can handle it.

Additionally we have a couple tmpfs systems:

/var/lib/mysql256 MB
/mnt/home/jenkins-deploy/tmpfs512 MB

That does not help for the ci.medium that only have 2GB ...

Depooling them from Jenkins

Change 269880 had a related patch set uploaded (by Hashar):
contint: lower tmpfs from 512MB to 200MB

https://gerrit.wikimedia.org/r/269880

I have disabled puppet, cherry picked the patch for tmpfs to 128MB and pooled back integration-slave-trusty-1009 and integration-slave-trusty-1010 with a single executor.

Now have to monitor them and see what happens.

All slaves now have 128MB tmpfs instead of 512MB. I pooled back the various ci.medium slaves we have created yesterday.

@hashar since your getting rid of those instances does that mean the load will get high again or will they be replaced with ones that have more memory.

The CI slaves we added yesterday do not have enough memory. An example of Linux triggering the OOM:

[Thu Feb 11 16:59:51 2016] Killed process 27671 (php5) total-vm:1184088kB, anon-rss:765928kB, file-rss:920kB

That doesn't fit.

1​ignal
2​[Thu Feb 11 16:59:51 2016] php5 invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
3​[Thu Feb 11 16:59:51 2016] php5 cpuset=/ mems_allowed=0
4​[Thu Feb 11 16:59:51 2016] CPU: 1 PID: 27671 Comm: php5 Not tainted 3.13.0-76-generic #120-Ubuntu
5​[Thu Feb 11 16:59:51 2016] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
6​[Thu Feb 11 16:59:51 2016] 0000000000000000 ffff880024685968 ffffffff81724b70 ffff880000171800
7​[Thu Feb 11 16:59:51 2016] ffff8800246859f0 ffffffff8171f177 0000000000000000 0000000000000000
8​[Thu Feb 11 16:59:51 2016] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
9​[Thu Feb 11 16:59:51 2016] Call Trace:
10​[Thu Feb 11 16:59:51 2016] [<ffffffff81724b70>] dump_stack+0x45/0x56
11​[Thu Feb 11 16:59:51 2016] [<ffffffff8171f177>] dump_header+0x7f/0x1f1
12​[Thu Feb 11 16:59:51 2016] [<ffffffff8115308e>] oom_kill_process+0x1ce/0x330
13​[Thu Feb 11 16:59:51 2016] [<ffffffff812d85a5>] ? security_capable_noaudit+0x15/0x20
14​[Thu Feb 11 16:59:51 2016] [<ffffffff811537c4>] out_of_memory+0x414/0x450
15​[Thu Feb 11 16:59:51 2016] [<ffffffff81159b00>] __alloc_pages_nodemask+0xa60/0xb80
16​[Thu Feb 11 16:59:51 2016] [<ffffffff81198073>] alloc_pages_current+0xa3/0x160
17​[Thu Feb 11 16:59:51 2016] [<ffffffff8114fc47>] __page_cache_alloc+0x97/0xc0
18​[Thu Feb 11 16:59:51 2016] [<ffffffff81151655>] filemap_fault+0x185/0x410
19​[Thu Feb 11 16:59:51 2016] [<ffffffff8117652f>] __do_fault+0x6f/0x530
20​[Thu Feb 11 16:59:51 2016] [<ffffffff8117a3b2>] handle_mm_fault+0x482/0xf10
21​[Thu Feb 11 16:59:51 2016] [<ffffffff8117e418>] ? find_vma+0x28/0x60
22​[Thu Feb 11 16:59:51 2016] [<ffffffff81730cb4>] __do_page_fault+0x184/0x570
23​[Thu Feb 11 16:59:51 2016] [<ffffffff81200361>] ? fsnotify+0x241/0x320
24​[Thu Feb 11 16:59:51 2016] [<ffffffff810a0605>] ? set_next_entity+0x95/0xb0
25​[Thu Feb 11 16:59:51 2016] [<ffffffff8101260b>] ? __switch_to+0x16b/0x4d0
26​[Thu Feb 11 16:59:51 2016] [<ffffffff817310ba>] do_page_fault+0x1a/0x70
27​[Thu Feb 11 16:59:51 2016] [<ffffffff810cde5e>] ? getnstimeofday+0xe/0x30
28​[Thu Feb 11 16:59:51 2016] [<ffffffff81730729>] do_async_page_fault+0x29/0xe0
29​[Thu Feb 11 16:59:51 2016] [<ffffffff8172d418>] async_page_fault+0x28/0x30
30​[Thu Feb 11 16:59:51 2016] Mem-Info:
31​[Thu Feb 11 16:59:51 2016] Node 0 DMA per-cpu:
32​[Thu Feb 11 16:59:51 2016] CPU 0: hi: 0, btch: 1 usd: 0
33​[Thu Feb 11 16:59:51 2016] CPU 1: hi: 0, btch: 1 usd: 0
34​[Thu Feb 11 16:59:51 2016] Node 0 DMA32 per-cpu:
35​[Thu Feb 11 16:59:51 2016] CPU 0: hi: 186, btch: 31 usd: 41
36​[Thu Feb 11 16:59:51 2016] CPU 1: hi: 186, btch: 31 usd: 181
37​[Thu Feb 11 16:59:51 2016] active_anon:356742 inactive_anon:119641 isolated_anon:0
38​[Thu Feb 11 16:59:51 2016] active_file:54 inactive_file:45 isolated_file:0
39​[Thu Feb 11 16:59:51 2016] unevictable:1350 dirty:0 writeback:0 unstable:0
40​[Thu Feb 11 16:59:51 2016] free:14392 slab_reclaimable:4119 slab_unreclaimable:4893
41​[Thu Feb 11 16:59:51 2016] mapped:944 shmem:7812 pagetables:5734 bounce:0
42​[Thu Feb 11 16:59:51 2016] free_cma:0
43​[Thu Feb 11 16:59:51 2016] Node 0 DMA free:8280kB min:348kB low:432kB high:520kB active_anon:3228kB inactive_anon:3948kB active_file:0kB inactive_file:4kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:32kB dirty:0kB writeback:0kB mapped:36kB shmem:592kB slab_reclaimable:56kB slab_unreclaimable:284kB kernel_stack:16kB pagetables:28kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:6 all_unreclaimable? yes
44​[Thu Feb 11 16:59:51 2016] lowmem_reserve[]: 0 1983 1983 1983
45​[Thu Feb 11 16:59:51 2016] Node 0 DMA32 free:49288kB min:44704kB low:55880kB high:67056kB active_anon:1423740kB inactive_anon:474616kB active_file:216kB inactive_file:176kB unevictable:5368kB isolated(anon):0kB isolated(file):0kB present:2080760kB managed:2034028kB mlocked:5368kB dirty:0kB writeback:0kB mapped:3740kB shmem:30656kB slab_reclaimable:16420kB slab_unreclaimable:19288kB kernel_stack:2168kB pagetables:22908kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:680 all_unreclaimable? yes
46​[Thu Feb 11 16:59:51 2016] lowmem_reserve[]: 0 0 0 0
47​[Thu Feb 11 16:59:51 2016] Node 0 DMA: 10*4kB (UEM) 8*8kB (UEM) 5*16kB (UEM) 11*32kB (UM) 7*64kB (UEM) 5*128kB (UEM) 0*256kB 3*512kB (UEM) 1*1024kB (E) 2*2048kB (UR) 0*4096kB = 8280kB
48​[Thu Feb 11 16:59:51 2016] Node 0 DMA32: 1523*4kB (UEM) 755*8kB (UEM) 597*16kB (UEM) 320*32kB (UEM) 140*64kB (UEM) 24*128kB (UE) 5*256kB (EM) 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 49332kB
49​[Thu Feb 11 16:59:51 2016] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
50​[Thu Feb 11 16:59:51 2016] 8933 total pagecache pages
51​[Thu Feb 11 16:59:51 2016] 129 pages in swap cache
52​[Thu Feb 11 16:59:51 2016] Swap cache stats: add 1089082, delete 1088953, find 582934/648236
53​[Thu Feb 11 16:59:51 2016] Free swap = 0kB
54​[Thu Feb 11 16:59:51 2016] Total swap = 499708kB
55​[Thu Feb 11 16:59:51 2016] 524188 pages RAM
56​[Thu Feb 11 16:59:51 2016] 0 pages HighMem/MovableOnly
57​[Thu Feb 11 16:59:51 2016] 11683 pages reserved
58​[Thu Feb 11 16:59:51 2016] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
59​[Thu Feb 11 16:59:51 2016] [ 324] 0 324 4868 96 14 28 0 upstart-udev-br
60​[Thu Feb 11 16:59:51 2016] [ 329] 0 329 10808 183 24 97 -1000 systemd-udevd
61​[Thu Feb 11 16:59:51 2016] [ 434] 0 434 12173 184 29 106 0 lldpd
62​[Thu Feb 11 16:59:51 2016] [ 487] 104 487 11116 46 23 93 0 lldpd
63​[Thu Feb 11 16:59:51 2016] [ 516] 0 516 4547 72 12 534 0 dhclient
64​[Thu Feb 11 16:59:51 2016] [ 557] 0 557 6258 123 18 65 0 rpcbind
65​[Thu Feb 11 16:59:51 2016] [ 583] 107 583 6834 175 19 121 0 rpc.statd
66​[Thu Feb 11 16:59:51 2016] [ 602] 0 602 3847 38 12 71 0 upstart-socket-
67​[Thu Feb 11 16:59:51 2016] [ 703] 102 703 8744 159 23 46 0 dbus-daemon
68​[Thu Feb 11 16:59:51 2016] [ 763] 0 763 5342 22 15 53 0 rpc.idmapd
69​[Thu Feb 11 16:59:51 2016] [ 780] 0 780 9508 221 24 43 0 systemd-logind
70​[Thu Feb 11 16:59:51 2016] [ 801] 0 801 3818 44 13 38 0 upstart-file-br
71​[Thu Feb 11 16:59:51 2016] [ 890] 0 890 2050 169 8 31 0 getty
72​[Thu Feb 11 16:59:51 2016] [ 892] 0 892 2050 169 8 31 0 getty
73​[Thu Feb 11 16:59:51 2016] [ 897] 0 897 2050 169 8 30 0 getty
74​[Thu Feb 11 16:59:51 2016] [ 898] 0 898 2050 169 9 32 0 getty
75​[Thu Feb 11 16:59:51 2016] [ 900] 0 900 2050 169 8 28 0 getty
76​[Thu Feb 11 16:59:51 2016] [ 924] 0 924 4329 180 13 36 0 cron
77​[Thu Feb 11 16:59:51 2016] [ 936] 0 936 4797 89 15 27 0 irqbalance
78​[Thu Feb 11 16:59:51 2016] [ 975] 106 975 12520 121 27 102 0 exim4
79​[Thu Feb 11 16:59:51 2016] [ 1216] 0 1216 2280 162 10 28 0 getty
80​[Thu Feb 11 16:59:51 2016] [ 1338] 103 1338 112706 1313 51 157 0 nslcd
81​[Thu Feb 11 16:59:51 2016] [ 1354] 0 1354 321836 305 76 484 0 nscd
82​[Thu Feb 11 16:59:51 2016] [ 1485] 0 1485 115742 421 120 8328 0 salt-minion
83​[Thu Feb 11 16:59:51 2016] [ 3738] 109 3738 121915 1858 70 2329 0 diamond
84​[Thu Feb 11 16:59:51 2016] [ 5642] 0 5642 137153 1159 123 8507 0 salt-minion
85​[Thu Feb 11 16:59:51 2016] [ 6482] 0 6482 14774 190 32 133 -1000 sshd
86​[Thu Feb 11 16:59:51 2016] [ 6700] 110 6700 5771 47 13 115 0 nrpe
87​[Thu Feb 11 16:59:51 2016] [ 7659] 111 7659 6505 184 13 94 0 ntpd
88​[Thu Feb 11 16:59:51 2016] [21020] 996 21020 22661 1620 46 640 0 Xvfb
89​[Thu Feb 11 16:59:51 2016] [15683] 995 15683 18804 899 38 977 0 gmond
90​[Thu Feb 11 16:59:51 2016] [30591] 101 30591 63631 316 27 643 0 rsyslogd
91​[Thu Feb 11 16:59:51 2016] [16821] 113 16821 341029 20019 220 12142 0 mysqld
92​[Thu Feb 11 16:59:51 2016] [25537] 0 25537 30157 219 60 259 0 sshd
93​[Thu Feb 11 16:59:51 2016] [25554] 2947 25554 31110 81 57 1172 0 sshd
94​[Thu Feb 11 16:59:51 2016] [23894] 0 23894 30157 224 61 253 0 sshd
95​[Thu Feb 11 16:59:51 2016] [23932] 2947 23932 30429 188 56 414 0 sshd
96​[Thu Feb 11 16:59:51 2016] [23962] 2947 23962 2779 171 10 50 0 bash
97​[Thu Feb 11 16:59:51 2016] [23963] 2947 23963 527278 39558 204 2533 0 java
98​[Thu Feb 11 16:59:51 2016] [24705] 0 24705 30157 225 60 252 0 sshd
99​[Thu Feb 11 16:59:51 2016] [24736] 2947 24736 30639 241 57 507 0 sshd
100​[Thu Feb 11 16:59:51 2016] [24758] 2947 24758 2779 122 9 49 0 bash
101​[Thu Feb 11 16:59:51 2016] [24759] 2947 24759 527542 44070 258 25705 0 java
102​[Thu Feb 11 16:59:51 2016] [ 7464] 0 7464 30157 226 62 250 0 sshd
103​[Thu Feb 11 16:59:51 2016] [ 7481] 2947 7481 30453 224 56 389 0 sshd
104​[Thu Feb 11 16:59:51 2016] [ 7499] 2947 7499 2779 171 10 50 0 bash
105​[Thu Feb 11 16:59:51 2016] [ 7500] 2947 7500 526964 26280 189 6966 0 java
106​[Thu Feb 11 16:59:51 2016] [25036] 0 25036 4419 1352 15 0 0 atop
107​[Thu Feb 11 16:59:51 2016] [27736] 0 27736 113652 276 139 1799 0 apache2
108​[Thu Feb 11 16:59:51 2016] [27740] 33 27740 115567 2777 171 1365 0 apache2
109​[Thu Feb 11 16:59:51 2016] [ 7257] 33 7257 117056 4178 162 1470 0 apache2
110​[Thu Feb 11 16:59:51 2016] [ 3567] 33 3567 122073 4224 170 6467 0 apache2
111​[Thu Feb 11 16:59:51 2016] [ 4316] 33 4316 116126 1936 172 2776 0 apache2
112​[Thu Feb 11 16:59:51 2016] [ 5104] 33 5104 116810 3816 174 1568 0 apache2
113​[Thu Feb 11 16:59:51 2016] [ 5125] 33 5125 116317 2387 171 2494 0 apache2
114​[Thu Feb 11 16:59:51 2016] [ 4321] 33 4321 118293 5123 164 1765 0 apache2
115​[Thu Feb 11 16:59:51 2016] [22749] 33 22749 117719 367 146 5930 0 apache2
116​[Thu Feb 11 16:59:51 2016] [11352] 33 11352 113672 225 129 1786 0 apache2
117​[Thu Feb 11 16:59:51 2016] [11385] 33 11385 113672 215 129 1784 0 apache2
118​[Thu Feb 11 16:59:51 2016] [27666] 2947 27666 2782 184 10 0 0 bash
119​[Thu Feb 11 16:59:51 2016] [27668] 2947 27668 2787 223 11 0 0 mw-run-phpunit-
120​[Thu Feb 11 16:59:51 2016] [27671] 2947 27671 296022 191712 496 688 0 php5
121​[Thu Feb 11 16:59:51 2016] [13520] 2947 13520 2782 184 10 0 0 bash
122​[Thu Feb 11 16:59:51 2016] [13526] 2947 13526 2786 222 10 0 0 mw-run-phpunit-
123​[Thu Feb 11 16:59:51 2016] [13538] 2947 13538 326916 112055 473 159 0 hhvm
124​[Thu Feb 11 16:59:51 2016] [13808] 2947 13808 88712 5858 130 35 0 hhvm
125​[Thu Feb 11 16:59:51 2016] [13809] 2947 13809 88712 5858 130 35 0 hhvm
126​[Thu Feb 11 16:59:51 2016] [13816] 2947 13816 88712 5858 130 35 0 hhvm
127​[Thu Feb 11 16:59:51 2016] [13817] 2947 13817 88712 5858 130 35 0 hhvm
128​[Thu Feb 11 16:59:51 2016] [13818] 2947 13818 88712 5887 134 35 0 hhvm
129​[Thu Feb 11 16:59:51 2016] Out of memory: Kill process 27671 (php5) score 302 or sacrifice child
130​[Thu Feb 11 16:59:51 2016] Killed process 27671 (php5) total-vm:1184088kB, anon-rss:765928kB, file-rss:920kB

I am thus de pooling and deleting ALL the ci.medium instances we created yesterday. 2GB is not enough for MediaWiki related tests.

hashar closed this task as "Resolved".Feb 11 2016, 5:32 PM
hashar claimed this task.

Fixed by deleting all c1.medium instances. There are still blocker but they are not really blocking anymore since the instances are gone :-}

Ok. Will they be replaced with ones that have more memory.

greg added a comment.Feb 11 2016, 5:33 PM

yes, that's the point :)

@greg thanks for replying.

hashar changed the status of subtask T126594: Disable HHVM fcgi server on CI slaves from "Open" to "Stalled".Feb 11 2016, 9:17 PM

This is definitely solved. Here is the summary:

On Feb 10th we have pooled ci.medium instances that only have 2GB of RAM. That was to accommodate the large shift of jobs from Precise to Trusty for php55 (see T126423).

We noticed a bunch of issue but eventually I depooled them at midnight and went to bed.

The day after, Feb 11th, during european business hours I have kept the slaves around to investigate/monitor/take trace/swear. Eventually around 17:30 UTC I have depooled and deleted them all.

I then created a bunch of m1.large slaves (8GB RAM, 4CPU). I have finished the provisioning after dinner. Encountered a minor issue but at 21:30 UTC all six new m1.large slaves are operational.

UbuntuTrusty label in Jenkins for the last 24 hours:

We went from 32 executors to 56 :-}

Change 269880 merged by Dzahn:
contint: lower tmpfs from 512MB to 256MB

https://gerrit.wikimedia.org/r/269880

hashar changed the status of subtask T126594: Disable HHVM fcgi server on CI slaves from "Stalled" to "Open".Apr 20 2016, 1:57 PM