Page MenuHomePhabricator

Unable to SSH onto tools-login.wmflabs.org
Closed, ResolvedPublic

Description

At the moment I am unable to SSH onto tools-login.wmflabs.org:

$ ssh ireas@login.tools.wmflabs.org
Permission denied (publickey,hostbased).

It seems that I am not the only user with that problem, see the #wikimedia-labs logs of today, 9:27.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The labs LDAP has some kind of troubles apparently.

[08:52:28] <icinga-wm> PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server

I can't authenticate on Jenkins (which uses LDAP for authentication).

Nodepool can not access the OpenStack API either, requests yield error 500.

I have poked the internal operations list. Can't further babysit this task right now though :-(

Mentioned in SAL [2016-03-19T10:51:50Z] <hashar> Labs LDAP is probably down. T130446 Cant log to tools-login.wmflabs.org / Jenkins interface and Nodepool yields error 500 communicating with OpenStack API

Luke081515 triaged this task as Unbreak Now! priority.Mar 19 2016, 11:06 AM
Luke081515 awarded a token.
Luke081515 subscribed.

looks like slapd got oom-killed, I've restarted it on seaborgium

Mar 19 08:48:29 seaborgium puppet-agent[8502]: Caching catalog for seaborgium.wikimedia.org
Mar 19 08:48:30 seaborgium kernel: [3354892.550626] puppet invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Mar 19 08:48:30 seaborgium kernel: [3354892.550631] puppet cpuset=/ mems_allowed=0
Mar 19 08:48:30 seaborgium kernel: [3354892.550641] CPU: 2 PID: 8502 Comm: puppet Not tainted 3.19.0-2-amd64 #1 Debian 3.19.3-9
Mar 19 08:48:30 seaborgium kernel: [3354892.550643] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
Mar 19 08:48:30 seaborgium kernel: [3354892.550646]  0000000000000000 0000000000000000 ffffffff8154da6b 00000000000280da
Mar 19 08:48:30 seaborgium kernel: [3354892.550649]  ffffffff8154cd7a 0000000000000002 ffffffff815513be 00000000ffffffff
Mar 19 08:48:30 seaborgium kernel: [3354892.550650]  ffffffff8106d317 ffffffff818f62c0 ffffffff810c4d2c ffff8800bb779418
Mar 19 08:48:30 seaborgium kernel: [3354892.550653] Call Trace:
Mar 19 08:48:30 seaborgium kernel: [3354892.550680]  [<ffffffff8154da6b>] ? dump_stack+0x40/0x50
Mar 19 08:48:30 seaborgium kernel: [3354892.550694]  [<ffffffff8154cd7a>] ? dump_header+0x95/0x1fd
Mar 19 08:48:30 seaborgium kernel: [3354892.550700]  [<ffffffff815513be>] ? mutex_lock+0xe/0x30
Mar 19 08:48:30 seaborgium kernel: [3354892.550714]  [<ffffffff8106d317>] ? put_online_cpus+0x27/0xa0
Mar 19 08:48:30 seaborgium kernel: [3354892.550723]  [<ffffffff810c4d2c>] ? rcu_oom_notify+0xcc/0xe0
Mar 19 08:48:30 seaborgium kernel: [3354892.550735]  [<ffffffff81152f67>] ? oom_kill_process+0x247/0x390
Mar 19 08:48:30 seaborgium kernel: [3354892.550737]  [<ffffffff81152adf>] ? find_lock_task_mm+0x3f/0xa0
Mar 19 08:48:30 seaborgium kernel: [3354892.550739]  [<ffffffff81153492>] ? out_of_memory+0x232/0x510
Mar 19 08:48:30 seaborgium kernel: [3354892.550742]  [<ffffffff81159071>] ? __alloc_pages_nodemask+0xac1/0xba0
Mar 19 08:48:30 seaborgium kernel: [3354892.550749]  [<ffffffff8119e4f7>] ? alloc_pages_vma+0xa7/0x1c0
Mar 19 08:48:30 seaborgium kernel: [3354892.550751]  [<ffffffff8115d420>] ? __put_single_page+0x20/0x20
Mar 19 08:48:30 seaborgium kernel: [3354892.550756]  [<ffffffff8117ed19>] ? handle_mm_fault+0xdd9/0x1040
Mar 19 08:48:30 seaborgium kernel: [3354892.550761]  [<ffffffff8105ca6b>] ? __do_page_fault+0x1ab/0x550
Mar 19 08:48:30 seaborgium kernel: [3354892.550764]  [<ffffffff811867c8>] ? mprotect_fixup+0x138/0x210
Mar 19 08:48:30 seaborgium kernel: [3354892.550767]  [<ffffffff81555658>] ? async_page_fault+0x28/0x30
Mar 19 08:48:30 seaborgium kernel: [3354892.550768] Mem-Info:
Mar 19 08:48:30 seaborgium kernel: [3354892.550772] Node 0 DMA per-cpu:
Mar 19 08:48:30 seaborgium kernel: [3354892.550774] CPU    0: hi:    0, btch:   1 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550775] CPU    1: hi:    0, btch:   1 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550776] CPU    2: hi:    0, btch:   1 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550777] CPU    3: hi:    0, btch:   1 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550778] Node 0 DMA32 per-cpu:
Mar 19 08:48:30 seaborgium kernel: [3354892.550780] CPU    0: hi:  186, btch:  31 usd:  76
Mar 19 08:48:30 seaborgium kernel: [3354892.550781] CPU    1: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550781] CPU    2: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550782] CPU    3: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550783] Node 0 Normal per-cpu:
Mar 19 08:48:30 seaborgium kernel: [3354892.550784] CPU    0: hi:  186, btch:  31 usd:  51
Mar 19 08:48:30 seaborgium kernel: [3354892.550785] CPU    1: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550786] CPU    2: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550787] CPU    3: hi:  186, btch:  31 usd:   0
Mar 19 08:48:30 seaborgium kernel: [3354892.550791] active_anon:707128 inactive_anon:264465 isolated_anon:0
Mar 19 08:48:30 seaborgium kernel: [3354892.550791]  active_file:0 inactive_file:98 isolated_file:0
Mar 19 08:48:30 seaborgium kernel: [3354892.550791]  unevictable:1517 dirty:14 writeback:0 unstable:0
Mar 19 08:48:30 seaborgium kernel: [3354892.550791]  free:22396 slab_reclaimable:3738 slab_unreclaimable:5668
Mar 19 08:48:30 seaborgium kernel: [3354892.550791]  mapped:2267 shmem:19787 pagetables:3212 bounce:0
Mar 19 08:48:30 seaborgium kernel: [3354892.550791]  free_cma:0
Mar 19 08:48:30 seaborgium kernel: [3354892.550794] Node 0 DMA free:15872kB min:264kB low:328kB high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 19 08:48:30 seaborgium kernel: [3354892.550798] lowmem_reserve[]: 0 2980 3939 3939
Mar 19 08:48:30 seaborgium kernel: [3354892.550800] Node 0 DMA32 free:56236kB min:50932kB low:63664kB high:76396kB active_anon:2361856kB inactive_anon:590952kB active_file:0kB inactive_file:96kB unevictable:3980kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3054388kB mlocked:3980kB dirty:144kB writeback:0kB mapped:3304kB shmem:52804kB slab_reclaimable:8380kB slab_unreclaimable:14952kB kernel_stack:1104kB pagetables:9616kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:572 all_unreclaimable? no
Mar 19 08:48:30 seaborgium kernel: [3354892.550804] lowmem_reserve[]: 0 0 958 958
Mar 19 08:48:30 seaborgium kernel: [3354892.550806] Node 0 Normal free:17352kB min:16380kB low:20472kB high:24568kB active_anon:466656kB inactive_anon:466908kB active_file:92kB inactive_file:396kB unevictable:2088kB isolated(anon):0kB isolated(file):0kB present:1048576kB managed:981752kB mlocked:2088kB dirty:0kB writeback:0kB mapped:5764kB shmem:26344kB slab_reclaimable:6568kB slab_unreclaimable:7720kB kernel_stack:896kB pagetables:3232kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:472 all_unreclaimable? no
Mar 19 08:48:30 seaborgium kernel: [3354892.550809] lowmem_reserve[]: 0 0 0 0
Mar 19 08:48:30 seaborgium kernel: [3354892.550811] Node 0 DMA: 2*4kB (UE) 1*8kB (E) 1*16kB (E) 3*32kB (UE) 4*64kB (UE) 1*128kB (E) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ER) 2*4096kB (M) = 15872kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550820] Node 0 DMA32: 984*4kB (UEM) 725*8kB (UEM) 485*16kB (UEM) 267*32kB (UEM) 158*64kB (UEM) 69*128kB (UEM) 21*256kB (UEM) 4*512kB (EM) 0*1024kB 0*2048kB 1*4096kB (R) = 56504kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550828] Node 0 Normal: 369*4kB (UEM) 262*8kB (UEM) 139*16kB (UEM) 56*32kB (UEM) 32*64kB (UEM) 11*128kB (UEM) 5*256kB (UM) 2*512kB (U) 0*1024kB 0*2048kB 1*4096kB (R) = 17444kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550849] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550850] 20969 total pagecache pages
Mar 19 08:48:30 seaborgium kernel: [3354892.550852] 128 pages in swap cache
Mar 19 08:48:30 seaborgium kernel: [3354892.550855] Swap cache stats: add 268953, delete 268825, find 271298/277626
Mar 19 08:48:30 seaborgium kernel: [3354892.550856] Free swap  = 0kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550857] Total swap = 998396kB
Mar 19 08:48:30 seaborgium kernel: [3354892.550858] 1048445 pages RAM
Mar 19 08:48:30 seaborgium kernel: [3354892.550858] 0 pages HighMem/MovableOnly
Mar 19 08:48:30 seaborgium kernel: [3354892.550859] 35433 pages reserved
Mar 19 08:48:30 seaborgium kernel: [3354892.550860] 0 pages hwpoisoned
Mar 19 08:48:30 seaborgium kernel: [3354892.550861] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Mar 19 08:48:30 seaborgium kernel: [3354892.550864] [  171]     0   171     8240     1328      22       43             0 systemd-journal
Mar 19 08:48:30 seaborgium kernel: [3354892.550866] [  173]     0   173    10259        2      21      192         -1000 systemd-udevd
Mar 19 08:48:30 seaborgium kernel: [3354892.550868] [  478]     0   478     4753        6      14       38             0 atd
Mar 19 08:48:30 seaborgium kernel: [3354892.550870] [  479]     0   479     6873       38      19       32             0 cron
Mar 19 08:48:30 seaborgium kernel: [3354892.550871] [  485]     0   485     4962       13      14       54             0 systemd-logind
Mar 19 08:48:30 seaborgium kernel: [3354892.550873] [  557]     0   557     1062        3       8       35             0 acpid
Mar 19 08:48:30 seaborgium kernel: [3354892.550875] [  562]     0   562     3602        3      12       36             0 agetty
Mar 19 08:48:30 seaborgium kernel: [3354892.550876] [  563]     0   563     3557        3      12       37             0 agetty
Mar 19 08:48:30 seaborgium kernel: [3354892.550878] [  567]     0   567    16736       10      30      247             0 bacula-fd
Mar 19 08:48:30 seaborgium kernel: [3354892.550880] [14062]   106 14062    10531      114      24       69          -900 dbus-daemon
Mar 19 08:48:30 seaborgium kernel: [3354892.550882] [14069]   111 14069    80786     3830      61     2439             0 diamond
Mar 19 08:48:30 seaborgium kernel: [3354892.550883] [14085]     0 14085     9270      106      24       91             0 rpcbind
Mar 19 08:48:30 seaborgium kernel: [3354892.550885] [14094]     0 14094    13969      413      29      122             0 lldpd
Mar 19 08:48:30 seaborgium kernel: [3354892.550886] [14097]   108 14097    13969       27      26      122             0 lldpd
Mar 19 08:48:30 seaborgium kernel: [3354892.550888] [14109]   999 14109    17657      336      39      701             0 gmond
Mar 19 08:48:30 seaborgium kernel: [3354892.550890] [14129]   107 14129     9320      352      24      147             0 rpc.statd
Mar 19 08:48:30 seaborgium kernel: [3354892.550891] [14141]     0 14141     5839        0      16       53             0 rpc.idmapd
Mar 19 08:48:30 seaborgium kernel: [3354892.550893] [14189]   105 14189    13312      265      28      144             0 exim4
Mar 19 08:48:30 seaborgium kernel: [3354892.550894] [14214]   110 14214     8447      446      21      107             0 ntpd
Mar 19 08:48:30 seaborgium kernel: [3354892.550896] [14219]     0 14219    65721      297      34       74             0 rsyslogd
Mar 19 08:48:30 seaborgium kernel: [3354892.550898] [14229]     0 14229    13896      438      31      131         -1000 sshd
Mar 19 08:48:30 seaborgium kernel: [3354892.550899] [12859]   113 12859  1458446   892462    2369   244149             0 slapd
Mar 19 08:48:30 seaborgium kernel: [3354892.550901] [21388]   112 21388     5946      468      16        0             0 nrpe
Mar 19 08:48:30 seaborgium kernel: [3354892.550903] [ 3241]     0  3241   129824     9400     114        0             0 salt-minion
Mar 19 08:48:30 seaborgium kernel: [3354892.550904] [11524]     0 11524     4588     1519      14        0             0 atop
Mar 19 08:48:30 seaborgium kernel: [3354892.550906] [ 8447]     0  8447    10556       81      25       10             0 cron
Mar 19 08:48:30 seaborgium kernel: [3354892.550908] [ 8448]     0  8448     1084      176       7        0             0 sh
Mar 19 08:48:30 seaborgium kernel: [3354892.550909] [ 8449]     0  8449     3309      424      10        0             0 puppet-run
Mar 19 08:48:30 seaborgium kernel: [3354892.550911] [ 8501]     0  8501     2519       95      10        0             0 timeout
Mar 19 08:48:30 seaborgium kernel: [3354892.550912] [ 8502]     0  8502    90908    45882     159        0             0 puppet
Mar 19 08:48:30 seaborgium kernel: [3354892.550914] Out of memory: Kill process 12859 (slapd) score 902 or sacrifice child
Mar 19 08:48:30 seaborgium kernel: [3354892.732696] Killed process 12859 (slapd) total-vm:5833784kB, anon-rss:3569848kB, file-rss:0kB

judging from ganglia, there's memory leakage which eventually finished the swap

serpens is still running fine. All labs instances use both serpens and seaborgium in their LDAP client config. tools-login and nodetool should also be converted to use a second failover LDAP server in their configurations.

Mentioned in SAL [2016-03-19T12:34:50Z] <godog> service supervisor stop, causing high traffic from ldap server T130446

possibly related, nslcd on zulip-01 was causing ~3MB/s of outgoing traffic on serpens and now seaborgium after a service nslcd restart. Likely due to fast-respawning zulip processes managed by supervisord, I've stopped supervisord for now and high traffic to seaborgium has stopped

Working for me too, thanks! Do you want to leave this task open to investigate the cause of the problem, or should I close it?

fgiunchedi lowered the priority of this task from Unbreak Now! to Medium.Mar 19 2016, 1:00 PM

thanks @Ireas we can leave it open as there's some followup to do still!

Mentioned in SAL [2016-03-19T13:04:29Z] <hashar> Jenkins: added ldap-labs-codfw.wikimedia.org as a fallback LDAP server T130446

All back for me as well. Thanks @fgiunchedi and @MoritzMuehlenhoff

For the record, Jenkins solely relied on ldap-labs.eqiad.wikimedia.org, (seaborgium), I have added the other as a fallback: ldap-labs-codfw.wikimedia.org.

Nodepool managed to boot an instance on 2016-03-19 08:15:49 UTC, and failed at 09:14UTC. Seems to indicate the OpenStack configuration is lacking a fallback to codfw LDAP server.

yuvipanda claimed this task.