Page MenuHomePhabricator

db1115 (tendril) paged twice in 24h due to OOM
Closed, ResolvedPublic

Description

db1115 paged twice:

25th Aug (times in UTC):

00:07 <+icinga-wm> PROBLEM - Check whether ferm is active by checking the default input chain on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
00:07 <+icinga-wm> PROBLEM - MariaDB disk space #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
00:07 <+icinga-wm> PROBLEM - dump of s7 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:07 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
00:07 <+icinga-wm> PROBLEM - DPKG on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
00:07 <+icinga-wm> PROBLEM - configured eth on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
00:07 <+icinga-wm> PROBLEM - Check systemd state on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:07 <+icinga-wm> PROBLEM - dhclient process on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
00:07 <+icinga-wm> PROBLEM - Check size of conntrack table on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
00:08 <+icinga-wm> PROBLEM - MD RAID on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
00:08 <+icinga-wm> PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
00:08 <+icinga-wm> PROBLEM - Disk space on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1115&var-datasource=eqiad+prometheus/ops
00:08 <+icinga-wm> PROBLEM - dump of s8 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:09 <+icinga-wm> PROBLEM - snapshot of s2 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:09 <+icinga-wm> PROBLEM - snapshot of s8 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:10 <+icinga-wm> RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
00:10 <+icinga-wm> PROBLEM - dump of s7 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:11 <+icinga-wm> PROBLEM - puppet last run on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
00:11 <+icinga-wm> PROBLEM - dump of m3 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:13 <+icinga-wm> PROBLEM - dump of s4 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:13 <+icinga-wm> PROBLEM - dump of s4 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:13 <+icinga-wm> RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
00:13 <+icinga-wm> PROBLEM - dump of m1 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:13 <+icinga-wm> PROBLEM - dump of m1 in eqiad on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
00:14 <+icinga-wm> PROBLEM - dump of s5 in codfw on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/Backups
14:48 <+icinga-wm> PROBLEM - mysqld processes #page on db1115 is CRITICAL: connect to address 10.64.0.122 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Aug 25, 3:01 PM

On the first crash MySQL never went down.
On the second crash, it was an OOM:

[12466353.895744] mysqld invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0-1, order=0, oom_score_adj=-600
[12466353.895747] mysqld cpuset=/ mems_allowed=0-1
[12466353.895754] CPU: 15 PID: 25940 Comm: mysqld Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[12466353.895756] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.5.4 08/17/2017
[12466353.895758]  0000000000000000 ffffffffaa534524 ffff9dd00e083c28 ffff89742c602000
[12466353.895762]  ffffffffaa407610 0000000000000000 00000000fffffda8 0000000c00000000
[12466353.895765]  ffff89857552f100 ffffffffaa38b577 0000004200000000 eab93e12bfd0287e
[12466353.895770] Call Trace:
[12466353.895782]  [<ffffffffaa534524>] ? dump_stack+0x5c/0x78
[12466353.895787]  [<ffffffffaa407610>] ? dump_header+0x78/0x1fd
[12466353.895793]  [<ffffffffaa38b577>] ? get_page_from_freelist+0x3f7/0xb20
[12466353.895796]  [<ffffffffaa3874ea>] ? oom_kill_process+0x21a/0x3e0
[12466353.895798]  [<ffffffffaa38716d>] ? oom_badness+0xed/0x170
[12466353.895801]  [<ffffffffaa387981>] ? out_of_memory+0x111/0x470
[12466353.895804]  [<ffffffffaa38cade>] ? __alloc_pages_slowpath+0xa9e/0xbb0
[12466353.895807]  [<ffffffffaa38cdf1>] ? __alloc_pages_nodemask+0x201/0x260
[12466353.895811]  [<ffffffffaa3de751>] ? alloc_pages_current+0x91/0x140
[12466353.895813]  [<ffffffffaa385966>] ? filemap_fault+0x326/0x5d0
[12466353.895838]  [<ffffffffc03fea41>] ? ext4_filemap_fault+0x31/0x50 [ext4]
[12466353.895841]  [<ffffffffaa3b6557>] ? __do_fault+0x87/0x170
[12466353.895843]  [<ffffffffaa3bafa7>] ? handle_mm_fault+0xeb7/0x1310
[12466353.895848]  [<ffffffffaa2622f5>] ? __do_page_fault+0x255/0x4f0
[12466353.895852]  [<ffffffffaa81a618>] ? page_fault+0x28/0x30
[12466353.895853] Mem-Info:
[12466353.895862] active_anon:31339079 inactive_anon:1349263 isolated_anon:0
                   active_file:0 inactive_file:0 isolated_file:0
                   unevictable:0 dirty:1 writeback:2 unstable:0
                   slab_reclaimable:12121 slab_unreclaimable:19740
                   mapped:29245 shmem:313586 pagetables:86916 bounce:0
                   free:88901 free_pcp:2 free_cma:0
[12466353.895868] Node 0 active_anon:62385736kB inactive_anon:2770544kB active_file:8kB inactive_file:60kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:84168kB dirty:4kB writeback:4kB shmem:902652kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:21359 all_unreclaimable? no
[12466353.895873] Node 1 active_anon:62970580kB inactive_anon:2626508kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:32812kB dirty:0kB writeback:4kB shmem:351692kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 4096kB writeback_tmp:0kB unstable:0kB pages_scanned:6963 all_unreclaimable? yes
[12466353.895875] Node 0 DMA free:15900kB min:8kB low:20kB high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[12466353.895881] lowmem_reserve[]: 0 1855 64301 64301 64301
[12466353.895886] Node 0 DMA32 free:251080kB min:1296kB low:3192kB high:5088kB active_anon:1653704kB inactive_anon:1316kB active_file:0kB inactive_file:8kB unevictable:0kB writepending:0kB present:1985264kB managed:1919696kB mlocked:0kB slab_reclaimable:448kB slab_unreclaimable:244kB kernel_stack:36kB pagetables:3004kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[12466353.895893] lowmem_reserve[]: 0 0 62446 62446 62446
[12466353.895897] Node 0 Normal free:43600kB min:43680kB low:107624kB high:171568kB active_anon:60732032kB inactive_anon:2769228kB active_file:0kB inactive_file:52kB unevictable:0kB writepending:8kB present:65011712kB managed:63949792kB mlocked:0kB slab_reclaimable:30204kB slab_unreclaimable:44372kB kernel_stack:6856kB pagetables:148592kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[12466353.895904] lowmem_reserve[]: 0 0 0 0 0
[12466353.895908] Node 1 Normal free:45024kB min:45120kB low:111176kB high:177232kB active_anon:62970580kB inactive_anon:2626508kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:4kB present:67108864kB managed:66057308kB mlocked:0kB slab_reclaimable:17832kB slab_unreclaimable:34344kB kernel_stack:6660kB pagetables:196068kB bounce:0kB free_pcp:8kB local_pcp:0kB free_cma:0kB
[12466353.895913] lowmem_reserve[]: 0 0 0 0 0
[12466353.895917] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
[12466353.895934] Node 0 DMA32: 1097*4kB (UME) 916*8kB (UME) 917*16kB (UME) 508*32kB (UME) 283*64kB (UME) 177*128kB (UME) 99*256kB (UME) 68*512kB (UME) 39*1024kB (ME) 31*2048kB (ME) 1*4096kB (M) = 251092kB
[12466353.895954] Node 0 Normal: 11608*4kB (UME) 18*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46576kB
[12466353.895966] Node 1 Normal: 11829*4kB (UME) 8*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 47380kB
[12466353.895981] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[12466353.895982] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[12466353.895984] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[12466353.895985] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[12466353.895986] 348130 total pagecache pages
[12466353.895988] 33411 pages in swap cache
[12466353.895990] Swap cache stats: add 1353189, delete 1319778, find 357935/519479
[12466353.895991] Free swap  = 0kB
[12466353.895992] Total swap = 976316kB
[12466353.895993] 33530456 pages RAM
[12466353.895994] 0 pages HighMem/MovableOnly
[12466353.895995] 544782 pages reserved
[12466353.895996] 0 pages hwpoisoned
[12466353.895997] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[12466353.896008] [  659]     0   659     1784       78      10       3        0             0 mdadm
[12466353.896011] [  792]   112   792   879764     6387     157       8      322             0 prometheus-node
[12466353.896014] [  794]   112   794   327993     5375      77       7      124             0 prometheus-mysq
[12466353.896017] [  795]     0   795     6288      234      17       3        0             0 smartd
[12466353.896020] [  818]     0   818     7416       72      19       3        0             0 cron
[12466353.896023] [  819]     0   819    11624      145      27       3        0             0 systemd-logind
[12466353.896026] [  823]   107   823    11303      170      26       3        1          -900 dbus-daemon
[12466353.896029] [  867]     0   867     3183        2      11       3       36             0 mcelog
[12466353.896032] [  876]     0   876     3575        1      12       3       32             0 agetty
[12466353.896034] [  892]     0   892     3631        0      12       3       36             0 agetty
[12466353.896038] [18595]     0 18595    11304       40      23       3       66         -1000 systemd-udevd
[12466353.896041] [18909]   100 18909    31821      121      32       3       11             0 systemd-timesyn
[12466353.896044] [29060]     0 29060    85217    50970     170       4       72             0 systemd-journal
[12466353.896047] [20317]   113 20317    14597       40      35       3       61             0 ulogd
[12466353.896049] [25675]   499 25675 46512601 32280077   85644     180   224652          -600 mysqld
[12466353.896052] [11792]     0 11792    14586      169      33       3        0             0 lldpd
[12466353.896055] [11892]   108 11892    24503    10083      49       3        0             0 lldpd
[12466353.896059] [ 2515]     0  2515    17488      192      37       4        0         -1000 sshd
[12466353.896062] [10903]     0 10903   305603     1208      99       5        0             0 rsyslogd
[12466353.896065] [ 3586]   106  3586    14039      165      31       3        0             0 exim4
[12466353.896068] [26603]     0 26603   599131     2331     113       7        0             0 prometheus-rsys
[12466353.896074] [29025]     0 29025    12206      104      28       3        0             0 cron
[12466353.896076] [29125]     0 29125     1069       17       8       3        0             0 sh
[12466353.896079] [29127]     0 29127     2793       58       9       3        0             0 puppet-run
[12466353.896082] [32222]     0 32222     2522       22      10       3        0             0 timeout
[12466353.896085] [32223]     0 32223   119805    30509     192       4        0             0 puppet
[12466353.896087] Out of memory: Kill process 25675 (mysqld) score 380 or sacrifice child
[12466353.905481] Killed process 25675 (mysqld) total-vm:186050404kB, anon-rss:129121076kB, file-rss:0kB, shmem-rss:0kB
[12466370.108134] oom_reaper: reaped process 25675 (mysqld), now anon-rss:0kB, file-rss:4kB, shmem-rss:0kB

HW logs are clean, so this looks like the "usual" memory leak from tendril.
We might need to decrease the pool size even more, or start disabling non-used stuff in tendril (graphs and things like that):

The first set of pages were due to the NREP daemon being killed by OOM and all the checks failing:

Aug 25 00:03:51 db1115 systemd[1]: nagios-nrpe-server.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 25 00:03:51 db1115 systemd[1]: nagios-nrpe-server.service: Failed to fork: Cannot allocate memory
Aug 25 00:03:51 db1115 systemd[1]: nagios-nrpe-server.service: Failed to run 'stop-post' task: Cannot allocate memory
Aug 25 00:03:51 db1115 systemd[1]: nagios-nrpe-server.service: Unit entered failed state.
Aug 25 00:03:51 db1115 systemd[1]: nagios-nrpe-server.service: Failed with result 'exit-code'.
Marostegui renamed this task from db1115 (tendril) crashed twice in 24h to db1115 (tendril) paged twice in 24h due to OOM.Sun, Aug 25, 3:52 PM
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui triaged this task as Normal priority.Mon, Aug 26, 7:02 AM

This looks like the recurrent memory issues tendril has: T196726 however this time it doesn't HW issues

Marostegui closed this task as Resolved.Wed, Aug 28, 5:04 AM
Marostegui claimed this task.

I am going to close this, as there is not much to do on this ticket itself, but it should be followed up on T231185: Disable/remove unused features on Tendril