Page MenuHomePhabricator

Investigate db1082 crash
Closed, ResolvedPublic

Description

db1082 crashed and deserves some investigation
it was depooled
mysql remains stopped for now

Details

Related Gerrit Patches:
operations/mediawiki-config : masterdb-eqiad.php: Restore db1082 original weight
operations/mediawiki-config : masterdb-eqiad.php: Increase weight db1082
operations/mediawiki-config : masterdb-eqiad: Repool db1082
operations/mediawiki-config : masterdb-eqiad.php: Depool db1082
operations/mediawiki-config : masterdb-eqiad.php: Restore db1082 original weight
operations/mediawiki-config : masterdb-eqiad.php: Repool db1082
operations/mediawiki-config : masterdb-eqiad.php: Depool db1082
operations/mediawiki-config : masterdb-eqiad.php: Restoring normal weight

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2016, 4:14 PM
Marostegui added a comment.EditedSep 13 2016, 4:29 PM

@jcrespo saw a kernel panic when he logged via console.

At a quick glance:
Also the server has been showing kernel errors lately (for the last few days:

Sep 13 11:06:17 db1082 kernel: [8880311.550721] BUG: Bad page state in process kworker/14:1  pfn:6144b98
Sep 13 11:06:17 db1082 kernel: [8880311.581783] page:ffffea018512e600 count:0 mapcount:-127 mapping:          (null) index:0x0
Sep 13 11:06:17 db1082 kernel: [8880311.621962] flags: 0x5ffff8000000000()
Sep 13 11:06:17 db1082 kernel: [8880311.640724] page dumped because: nonzero mapcount
Sep 13 11:06:17 db1082 kernel: [8880311.663822] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod sg crc32c_intel psmouse tg3 xhci_pci ptp ehci_pci hpsa uhci_hcd pps_core xhci_hcd ehci_hcd scsi_transport_sas libphy usbcore usb_common scsi_mod fjes
Sep 13 11:06:17 db1082 kernel: [8880311.663882] CPU: 14 PID: 1895 Comm: kworker/14:1 Tainted: G    B           4.4.0-1-amd64 #1 Debian 4.4.2-3+wmf2
Sep 13 11:06:17 db1082 kernel: [8880311.663884] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Sep 13 11:06:17 db1082 kernel: [8880311.663889] Workqueue: events sg_remove_sfp_usercontext [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.663891]  0000000000000286 00000000ecbe1cfe ffffffff812ee705 ffffea018512e600
Sep 13 11:06:17 db1082 kernel: [8880311.663892]  ffffffff8181d942 ffffffff8116dd78 ffffea018512e600 0000000000000000
Sep 13 11:06:17 db1082 kernel: [8880311.663894]  ffffea018512e800 ffffffff8116e6c9 ff00007fff400000 ff00007fffffffff
Sep 13 11:06:17 db1082 kernel: [8880311.663897] Call Trace:
Sep 13 11:06:17 db1082 kernel: [8880311.663905]  [<ffffffff812ee705>] ? dump_stack+0x5c/0x77
Sep 13 11:06:17 db1082 kernel: [8880311.663911]  [<ffffffff8116dd78>] ? bad_page.part.68+0xa8/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.663912]  [<ffffffff8116e6c9>] ? free_pages_prepare+0x1f9/0x2f0
Sep 13 11:06:17 db1082 kernel: [8880311.663914]  [<ffffffff81170895>] ? __free_pages_ok+0x15/0xb0
Sep 13 11:06:17 db1082 kernel: [8880311.663916]  [<ffffffffa004c803>] ? sg_remove_scat.isra.13+0x73/0x130 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.663917]  [<ffffffffa004dbb5>] ? sg_remove_sfp_usercontext+0x65/0x120 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.663922]  [<ffffffff8109101d>] ? process_one_work+0x14d/0x410
Sep 13 11:06:17 db1082 kernel: [8880311.663924]  [<ffffffff81091a95>] ? worker_thread+0x65/0x460
Sep 13 11:06:17 db1082 kernel: [8880311.663928]  [<ffffffff81091a30>] ? rescuer_thread+0x310/0x310
Sep 13 11:06:17 db1082 kernel: [8880311.663930]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.663932]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.663938]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
Sep 13 11:06:17 db1082 kernel: [8880311.663940]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.663941] BUG: Bad page state in process kworker/14:1  pfn:6144b99
Sep 13 11:06:17 db1082 kernel: [8880311.694279] page:ffffea018512e640 count:1 mapcount:1 mapping:ffff887f65a790c9 index:0x7f2a59ee0
Sep 13 11:06:17 db1082 kernel: [8880311.736805] flags: 0x5ffff800004007c(referenced|uptodate|dirty|lru|active|swapbacked)
Sep 13 11:06:17 db1082 kernel: [8880311.775141] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Sep 13 11:06:17 db1082 kernel: [8880311.807141] bad because of flags:
Sep 13 11:06:17 db1082 kernel: [8880311.823747] flags: 0x60(lru|active)
Sep 13 11:06:17 db1082 kernel: [8880311.840942] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod sg crc32c_intel psmouse tg3 xhci_pci ptp ehci_pci hpsa uhci_hcd pps_core xhci_hcd ehci_hcd scsi_transport_sas libphy usbcore usb_common scsi_mod fjes
Sep 13 11:06:17 db1082 kernel: [8880311.842024] CPU: 14 PID: 1895 Comm: kworker/14:1 Tainted: G    B           4.4.0-1-amd64 #1 Debian 4.4.2-3+wmf2
Sep 13 11:06:17 db1082 kernel: [8880311.842153] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Sep 13 11:06:17 db1082 kernel: [8880311.842161] Workqueue: events sg_remove_sfp_usercontext [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.842166]  0000000000000286 00000000ecbe1cfe ffffffff812ee705 ffffea018512e640
Sep 13 11:06:17 db1082 kernel: [8880311.842170]  ffffffff818194d0 ffffffff8116dd78 ffffea018512e640 0000000000000001
Sep 13 11:06:17 db1082 kernel: [8880311.842173]  ffffea018512e800 ffffffff8116e6c9 ff00007fff400000 ff00007fffffffff
Sep 13 11:06:17 db1082 kernel: [8880311.842177] Call Trace:
Sep 13 11:06:17 db1082 kernel: [8880311.842334]  [<ffffffff812ee705>] ? dump_stack+0x5c/0x77
Sep 13 11:06:17 db1082 kernel: [8880311.842467]  [<ffffffff8116dd78>] ? bad_page.part.68+0xa8/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.842469]  [<ffffffff8116e6c9>] ? free_pages_prepare+0x1f9/0x2f0
Sep 13 11:06:17 db1082 kernel: [8880311.842470]  [<ffffffff81170895>] ? __free_pages_ok+0x15/0xb0
Sep 13 11:06:17 db1082 kernel: [8880311.842474]  [<ffffffffa004c803>] ? sg_remove_scat.isra.13+0x73/0x130 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.842476]  [<ffffffffa004dbb5>] ? sg_remove_sfp_usercontext+0x65/0x120 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.842481]  [<ffffffff8109101d>] ? process_one_work+0x14d/0x410
Sep 13 11:06:17 db1082 kernel: [8880311.842482]  [<ffffffff81091a95>] ? worker_thread+0x65/0x460
Sep 13 11:06:17 db1082 kernel: [8880311.842483]  [<ffffffff81091a30>] ? rescuer_thread+0x310/0x310
Sep 13 11:06:17 db1082 kernel: [8880311.842486]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.842487]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.842494]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
Sep 13 11:06:17 db1082 kernel: [8880311.842495]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.842497] BUG: Bad page state in process kworker/14:1  pfn:6144b9a
Sep 13 11:06:17 db1082 kernel: [8880311.873329] page:ffffea018512e680 count:0 mapcount:-127 mapping:          (null) index:0xffff887f64d38fc0
Sep 13 11:06:17 db1082 kernel: [8880311.919947] flags: 0x5ffff8000000000()
Sep 13 11:06:17 db1082 kernel: [8880311.938859] page dumped because: nonzero mapcount
Sep 13 11:06:17 db1082 kernel: [8880311.962226] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod sg crc32c_intel psmouse tg3 xhci_pci ptp ehci_pci hpsa uhci_hcd pps_core xhci_hcd ehci_hcd scsi_transport_sas libphy usbcore usb_common scsi_mod fjes
Sep 13 11:06:17 db1082 kernel: [8880311.962405] CPU: 14 PID: 1895 Comm: kworker/14:1 Tainted: G    B           4.4.0-1-amd64 #1 Debian 4.4.2-3+wmf2
Sep 13 11:06:17 db1082 kernel: [8880311.962406] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Sep 13 11:06:17 db1082 kernel: [8880311.962409] Workqueue: events sg_remove_sfp_usercontext [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.962410]  0000000000000286 00000000ecbe1cfe ffffffff812ee705 ffffea018512e680
Sep 13 11:06:17 db1082 kernel: [8880311.962411]  ffffffff8181d942 ffffffff8116dd78 ffffea018512e680 0000000000000002
Sep 13 11:06:17 db1082 kernel: [8880311.962413]  ffffea018512e800 ffffffff8116e6c9 ff00007fff400000 ff00007fffffffff
Sep 13 11:06:17 db1082 kernel: [8880311.962414] Call Trace:
Sep 13 11:06:17 db1082 kernel: [8880311.962426]  [<ffffffff812ee705>] ? dump_stack+0x5c/0x77
Sep 13 11:06:17 db1082 kernel: [8880311.962429]  [<ffffffff8116dd78>] ? bad_page.part.68+0xa8/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.962430]  [<ffffffff8116e6c9>] ? free_pages_prepare+0x1f9/0x2f0
Sep 13 11:06:17 db1082 kernel: [8880311.962432]  [<ffffffff81170895>] ? __free_pages_ok+0x15/0xb0
Sep 13 11:06:17 db1082 kernel: [8880311.962433]  [<ffffffffa004c803>] ? sg_remove_scat.isra.13+0x73/0x130 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.962435]  [<ffffffffa004dbb5>] ? sg_remove_sfp_usercontext+0x65/0x120 [sg]
Sep 13 11:06:17 db1082 kernel: [8880311.962437]  [<ffffffff8109101d>] ? process_one_work+0x14d/0x410
Sep 13 11:06:17 db1082 kernel: [8880311.962438]  [<ffffffff81091a95>] ? worker_thread+0x65/0x460
Sep 13 11:06:17 db1082 kernel: [8880311.962439]  [<ffffffff81091a30>] ? rescuer_thread+0x310/0x310
Sep 13 11:06:17 db1082 kernel: [8880311.962441]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
Sep 13 11:06:17 db1082 kernel: [8880311.962580]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.962584]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
Sep 13 11:06:17 db1082 kernel: [8880311.962586]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880311.962587] BUG: Bad page state in process kworker/14:1  pfn:6144b9c
Sep 13 11:06:17 db1082 kernel: [8880311.993326] page:ffffea018512e700 count:0 mapcount:-127 mapping:          (null) index:0x0
Sep 13 11:06:17 db1082 kernel: [8880312.033806] flags: 0x5ffff8000000000()
Sep 13 11:06:17 db1082 kernel: [8880312.052544] page dumped because: nonzero mapcount
Sep 13 11:06:17 db1082 kernel: [8880312.075999] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod sg crc32c_intel psmouse tg3 xhci_pci ptp ehci_pci hpsa uhci_hcd pps_core xhci_hcd ehci_hcd scsi_transport_sas libphy usbcore usb_common scsi_mod fjes
Sep 13 11:06:17 db1082 kernel: [8880312.076057] CPU: 14 PID: 1895 Comm: kworker/14:1 Tainted: G    B           4.4.0-1-amd64 #1 Debian 4.4.2-3+wmf2
Sep 13 11:06:17 db1082 kernel: [8880312.076059] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Sep 13 11:06:17 db1082 kernel: [8880312.076063] Workqueue: events sg_remove_sfp_usercontext [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.076064]  0000000000000286 00000000ecbe1cfe ffffffff812ee705 ffffea018512e700
Sep 13 11:06:17 db1082 kernel: [8880312.076066]  ffffffff8181d942 ffffffff8116dd78 ffffea018512e700 0000000000000003
Sep 13 11:06:17 db1082 kernel: [8880312.076068]  ffffea018512e800 ffffffff8116e6c9 ff00007fff400000 ff00007fffffffff
Sep 13 11:06:17 db1082 kernel: [8880312.076074] Call Trace:
Sep 13 11:06:17 db1082 kernel: [8880312.076080]  [<ffffffff812ee705>] ? dump_stack+0x5c/0x77
Sep 13 11:06:17 db1082 kernel: [8880312.076084]  [<ffffffff8116dd78>] ? bad_page.part.68+0xa8/0x100
Sep 13 11:06:17 db1082 kernel: [8880312.076086]  [<ffffffff8116e6c9>] ? free_pages_prepare+0x1f9/0x2f0
Sep 13 11:06:17 db1082 kernel: [8880312.076088]  [<ffffffff81170895>] ? __free_pages_ok+0x15/0xb0
Sep 13 11:06:17 db1082 kernel: [8880312.076090]  [<ffffffffa004c803>] ? sg_remove_scat.isra.13+0x73/0x130 [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.076091]  [<ffffffffa004dbb5>] ? sg_remove_sfp_usercontext+0x65/0x120 [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.076095]  [<ffffffff8109101d>] ? process_one_work+0x14d/0x410
Sep 13 11:06:17 db1082 kernel: [8880312.076097]  [<ffffffff81091a95>] ? worker_thread+0x65/0x460
Sep 13 11:06:17 db1082 kernel: [8880312.076099]  [<ffffffff81091a30>] ? rescuer_thread+0x310/0x310
Sep 13 11:06:17 db1082 kernel: [8880312.076101]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
Sep 13 11:06:17 db1082 kernel: [8880312.076103]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880312.076108]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
Sep 13 11:06:17 db1082 kernel: [8880312.076110]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880312.076112] BUG: Bad page state in process kworker/14:1  pfn:6144b9d
Sep 13 11:06:17 db1082 kernel: [8880312.106925] page:ffffea018512e740 count:1 mapcount:14 mapping:ffff886144b9d028 index:0xffff886144b9d000
Sep 13 11:06:17 db1082 kernel: [8880312.150905] flags: 0x5ffff8000000080(slab)
Sep 13 11:06:17 db1082 kernel: [8880312.170669] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Sep 13 11:06:17 db1082 kernel: [8880312.201346] bad because of flags:
Sep 13 11:06:17 db1082 kernel: [8880312.217515] flags: 0x80(slab)
Sep 13 11:06:17 db1082 kernel: [8880312.232030] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod sg crc32c_intel psmouse tg3 xhci_pci ptp ehci_pci hpsa uhci_hcd pps_core xhci_hcd ehci_hcd scsi_transport_sas libphy usbcore usb_common scsi_mod fjes
Sep 13 11:06:17 db1082 kernel: [8880312.232078] CPU: 14 PID: 1895 Comm: kworker/14:1 Tainted: G    B           4.4.0-1-amd64 #1 Debian 4.4.2-3+wmf2
Sep 13 11:06:17 db1082 kernel: [8880312.232079] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
Sep 13 11:06:17 db1082 kernel: [8880312.232083] Workqueue: events sg_remove_sfp_usercontext [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.232085]  0000000000000286 00000000ecbe1cfe ffffffff812ee705 ffffea018512e740
Sep 13 11:06:17 db1082 kernel: [8880312.232086]  ffffffff818194d0 ffffffff8116dd78 ffffea018512e740 0000000000000004
Sep 13 11:06:17 db1082 kernel: [8880312.232087]  ffffea018512e800 ffffffff8116e6c9 ff00007fff400000 ff00007fffffffff
Sep 13 11:06:17 db1082 kernel: [8880312.232089] Call Trace:
Sep 13 11:06:17 db1082 kernel: [8880312.232096]  [<ffffffff812ee705>] ? dump_stack+0x5c/0x77
Sep 13 11:06:17 db1082 kernel: [8880312.232099]  [<ffffffff8116dd78>] ? bad_page.part.68+0xa8/0x100
Sep 13 11:06:17 db1082 kernel: [8880312.232101]  [<ffffffff8116e6c9>] ? free_pages_prepare+0x1f9/0x2f0
Sep 13 11:06:17 db1082 kernel: [8880312.232102]  [<ffffffff81170895>] ? __free_pages_ok+0x15/0xb0
Sep 13 11:06:17 db1082 kernel: [8880312.232103]  [<ffffffffa004c803>] ? sg_remove_scat.isra.13+0x73/0x130 [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.232105]  [<ffffffffa004dbb5>] ? sg_remove_sfp_usercontext+0x65/0x120 [sg]
Sep 13 11:06:17 db1082 kernel: [8880312.232108]  [<ffffffff8109101d>] ? process_one_work+0x14d/0x410
Sep 13 11:06:17 db1082 kernel: [8880312.232110]  [<ffffffff81091a95>] ? worker_thread+0x65/0x460
Sep 13 11:06:17 db1082 kernel: [8880312.232111]  [<ffffffff81091a30>] ? rescuer_thread+0x310/0x310
Sep 13 11:06:17 db1082 kernel: [8880312.232113]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
Sep 13 11:06:17 db1082 kernel: [8880312.232114]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880312.232120]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
Sep 13 11:06:17 db1082 kernel: [8880312.232122]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
Sep 13 11:06:17 db1082 kernel: [8880312.232124] BUG: Bad page state in process kworker/14:1  pfn:6144b9e
Sep 13 11:06:17 db1082 kernel: [8880312.262270] page:ffffea018512e780 count:0 mapcount:-127 mapping:          (null) index:0xffff887f6556c500
Sep 13 11:06:18 db1082 kernel: [8880312.307073] flags: 0x5ffff8000000000()
Sep 13 11:06:18 db1082 kernel: [8880312.325107] page dumped because: nonzero mapcount
debug2: channel 2: window 996603 sent adjust 51973
Sep 13 11:06:18 db1082 kernel: [8880312.347507] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 xt_pkttype xt_CT ip6table_raw ip6table_filter ip6_tables iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables binfmt_misc 8021q garp mrp stp llc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc xfs libcrc32c crc32c_generic intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul jitterentropy_rng iTCO_wdt iTCO_vendor_support evdev sha256_ssse3 sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw pcspkr mgag200 ttm lpc_ich mfd_core drm_kms_helper sb_edac edac_core drm hpwdt i2c_i801 i2c_algo_bit hpilo ioatdma dca tpm_tis tpm 8250_fintek ipmi_si ipmi_msghandler pcc_cpufreq shpchp acpi_cpufreq processor wmi acpi_power_meter button autofs4 ext4 crc16 mbcache jbd2 dm_mod sd_mod s

RAID looks good atm:

root@db1082:~# hpssacli controller all show config

Smart Array P840 in Slot 1                (sn: PDNNF0ARH1910I)


   Internal Drive Cage at Port 1I, Box 1, OK

   Internal Drive Cage at Port 1I, Box 1, OK

   Internal Drive Cage at Port 2I, Box 2, OK
   array A (Solid State SATA, Unused Space: 0  MB)


      logicaldrive 1 (3.6 TB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:5 (port 1I:box 1:bay 5, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:6 (port 1I:box 1:bay 6, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:7 (port 1I:box 1:bay 7, Solid State SATA, 800 GB, OK)
      physicaldrive 1I:1:8 (port 1I:box 1:bay 8, Solid State SATA, 800 GB, OK)
      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, Solid State SATA, 800 GB, OK)
      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, Solid State SATA, 800 GB, OK)

We should check the controller's log to see if there's something in there

Server was depooled here: https://gerrit.wikimedia.org/r/#/c/310335/

jcrespo added a subscriber: MoritzMuehlenhoff.EditedSep 13 2016, 5:14 PM

I am adding Moritz, not expecting to do anything here, but just a heads up incase he is aware of any recent kernel issue and we are behind in updates for this server.

Let's do proper debugging tomorrow. Normally, we will just start it, but I will impose for now the restriction of not restarting any mysql (or started crashing servers) without being extra careful.

The stack trace sounds like a broken disk controller (or possibly broken RAM). I'd say let Chris to a hardware check.

Thank you @MoritzMuehlenhoff for your incredibly quick evaluation, I didn't even check the full stacktrace, you were really helpful. I will unsubscribe you so you do not suffer spam from the rest of the tasks.

@Marostegui While we wait, probably we can check the lifecycle hardware logs.

Unfortunately the ILO isn't showing anything relevant hardware-wise between the crash and when we power cyled the server

This is the first record from yesterday which is basically when we connected to it

/map1/log1/record68
  Targets
  Properties
    number=68
    severity=Informational
    date=09/13/2016
    time=15:50
    description=SSH login: root - xx.154.149(DNS name not found).
  Verbs
    cd version exit show

And then nothing relevant until the restart

/map1/log1/record73
  Targets
  Properties
    number=73
    severity=Informational
    date=09/13/2016
    time=16:01
    description=Virtual Serial Port stopped by: root - xx.154.149(DNS name not found).
  Verbs
    cd version exit show


/map1/log1/record74
  Targets
  Properties
    number=74
    severity=Caution
    date=09/13/2016
    time=16:03
    description=Host server reset by: root.
  Verbs
    cd version exit show

/map1/log1/record75
  Targets
  Properties
    number=75
    severity=Caution
    date=09/13/2016
    time=16:03
    description=Server reset.
  Verbs
    cd version exit show

/map1/log1/record76
  Targets
  Properties
    number=76
    severity=Informational
    date=09/13/2016
    time=16:03
    description=Server power restored.
  Verbs
    cd version exit show

/map1/log1/record77
  Targets
  Properties
    number=77
    severity=Informational
    date=09/13/2016
    time=16:03
    description=Embedded Flash/SD-CARD: Restarted.
  Verbs
    cd version exit show
Marostegui moved this task from Triage to In progress on the DBA board.
jcrespo triaged this task as High priority.Sep 14 2016, 10:56 AM

I do not see anything with the server that we could pinpoint to a h/w issue.

After the memtest (no errors found) the server is back and catching up with the master.
Once it caught up, we will pool it back and slowly give it some weight in the LB.

@jcrespo mentioned he wasn't trusting the server so much so I have been running different stress tests, cpu (sys, user), mem, iowait etc for the whole day to introduce some overload situations while MySQL was running to see what would happen...

The server has behaved fine and I didn't see any significant issue.
Replication lagged a bit (expected) but other than that it was fine.

I would suggest we get it in the LB on Monday with less weight and see how it behaves with production traffic.

The server has behaved fine and I didn't see any significant issue.

Check size of conntrack table
Notifications for this service have been disabled
WARNING 2016-09-17 03:19:29 1d 13h 44m 45s 3/3 WARNING: could not read sysctl settings

NTP
Notifications for this service have been disabled
CRITICAL 2016-09-17 03:21:24 2d 10h 45m 32s 20/20 NTP CRITICAL: Offset unknown

The sysctl settings error looks gone now, and I can read them actually:

root@db1082:/proc/sys/net# sysctl -a | wc -l

1702

The offset error looks weird:

root@db1082:/proc/sys/net# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 chromium.wikime .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 hydrogen.wikime .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 acamar.wikimedi .INIT.          16 u    - 1024    0    0.000    0.000   0.000
 achernar.wikime .INIT.          16 u    - 1024    0    0.000    0.000   0.000
root@db1082:/proc/sys/net# hwclock ; date
Mon 19 Sep 2016 06:37:20 AM UTC  -0.500580 seconds
Mon Sep 19 06:37:20 UTC 2016
root@db1082:/proc/sys/net# sudo /etc/init.d/ntp stop
[ ok ] Stopping ntp (via systemctl): ntp.service.
root@db1082:/proc/sys/net# ntpd -gq
ntpd: time slew +0.086190s
root@db1082:/proc/sys/net# sudo /etc/init.d/ntp start
[ ok ] Starting ntp (via systemctl): ntp.service.

This has not cleared the NTP error in Icinga though. Should we reboot the box and start from scratch to see in which state it comes back?

NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine

NTP is now cleared - might be worth a reboot and let's see if it comes back up all fine

In fact, I would reboot it several times to see if it happens again and try to understand why. The reason I got alarmed is because last time we got one of those issues, it was an indication of board issues. It may be just a software issues, but that is why I wanted to be cautious.

Makes clear, thanks for giving me context on past issues.
I will do that for for a few times and by the end of the day I will give it another final reboot and leave it like that for a few days, just replicating.

Thanks again

Just one thing, rebooting would be a great way to test https://gerrit.wikimedia.org/r/#/c/310564/ In fact, I am going to test it on db1061 now, too.

Sounds good - I have rebooted it twice already and expect to do a few more before the end of the day.

Everytime the server gets restarted NTP alerts until I run the ntp sync manually.
I have rebooted again to see how it comes back and what happens if I do not touch it.

How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart of NTP-

@Marostegui NTP being off for some minutes is "normal" (Known limitation with low priority) What it was an issue/strange is it being off for hours/days.

@MoritzMuehlenhoff see above:

Check size of conntrack table
Notifications for this service have been disabled
WARNING 2016-09-17 03:19:29 1d 13h 44m 45s 3/3 WARNING: could not read sysctl settings
NTP
Notifications for this service have been disabled
CRITICAL 2016-09-17 03:21:24 2d 10h 45m 32s 20/20 NTP CRITICAL: Offset unknown

How long of a time frame are we talking here? All servers have "Unknown offset" alerts for about 10-20 minutes after a reboot or a service restart of NTP-

The offset has been there for the whole weekend until I sync'ed it manually a couple of hours ago.

I have done a final reboot and I am going to leave it untouched for a few hours to see how it copes.
MySQL is started and replicating normally.

Thanks for the input though!

This server has been running for 24h with no issues so far, reported. I would like to pool it in tomorrow with some weight (not much) to see how it starts coping with production traffic.

Any thoughts on that?

Yeah, let's repool and check whether it happens again.

I noticed T141756 which could be related (since db1082 also has the hardware and the oops looked I/O controller-related)

Interesting...feel free to upgrade that firmware if you want. The box isn't pooled yet.

I will coordinate with @Cmjohnson to get this upgraded before we repool it back

@Cmjohnson let me know if you want to proceed with this upgrade sometime this week?
This server needs to be depooled first.

Change 314515 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restoring normal weight

https://gerrit.wikimedia.org/r/314515

Change 314515 merged by jenkins-bot:
db-eqiad.php: Restoring normal weight

https://gerrit.wikimedia.org/r/314515

For now I have restored its original value until we agreed on when we can upgrade it.
So far it has been behaving fine since it crashed around a month ago.

Change 314650 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1082

https://gerrit.wikimedia.org/r/314650

Change 314650 merged by jenkins-bot:
db-eqiad.php: Depool db1082

https://gerrit.wikimedia.org/r/314650

Mentioned in SAL (#wikimedia-operations) [2016-10-07T06:31:15Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1082 to get its raid controller firmware upgraded - T145533 (duration: 00m 49s)

Change 314673 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool db1082

https://gerrit.wikimedia.org/r/314673

Change 314673 merged by jenkins-bot:
db-eqiad.php: Repool db1082

https://gerrit.wikimedia.org/r/314673

Mentioned in SAL (#wikimedia-operations) [2016-10-07T11:17:05Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1082 with a bit less weight than usual to start with - T145533 (duration: 00m 55s)

Change 314680 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1082 original weight

https://gerrit.wikimedia.org/r/314680

Change 314680 merged by jenkins-bot:
db-eqiad.php: Restore db1082 original weight

https://gerrit.wikimedia.org/r/314680

Mentioned in SAL (#wikimedia-operations) [2016-10-07T12:09:11Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1082 with its original weight - T145533 (duration: 00m 52s)

Change 315045 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1082

https://gerrit.wikimedia.org/r/315045

Change 315045 merged by jenkins-bot:
db-eqiad.php: Depool db1082

https://gerrit.wikimedia.org/r/315045

Mentioned in SAL (#wikimedia-operations) [2016-10-10T06:13:56Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1082 to upgrade its RAID controller firmware - T145533 (duration: 00m 50s)

Upgraded:

root@db1082:~# hpssacli controller slot=1 show | grep -i firmware
   Firmware Version: 4.02

I will slowly get this server back to the pool but I think this ticket can be closed.

Marostegui closed this task as Resolved.Oct 10 2016, 7:42 AM

Change 315065 had a related patch set uploaded (by Marostegui):
db-eqiad: Repool db1082

https://gerrit.wikimedia.org/r/315065

Change 315065 merged by jenkins-bot:
db-eqiad: Repool db1082

https://gerrit.wikimedia.org/r/315065

Mentioned in SAL (#wikimedia-operations) [2016-10-10T10:42:35Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1082 with some small weight after its RAID controller firmware - T145533 (duration: 00m 50s)

Change 315071 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Increase weight db1082

https://gerrit.wikimedia.org/r/315071

Change 315071 merged by jenkins-bot:
db-eqiad.php: Increase weight db1082

https://gerrit.wikimedia.org/r/315071

Mentioned in SAL (#wikimedia-operations) [2016-10-10T11:33:16Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase weight for db1082 after its RAID controller firmware - T145533 (duration: 00m 49s)

Change 315080 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Restore db1082 original weight

https://gerrit.wikimedia.org/r/315080

Change 315080 merged by jenkins-bot:
db-eqiad.php: Restore db1082 original weight

https://gerrit.wikimedia.org/r/315080

Mentioned in SAL (#wikimedia-operations) [2016-10-10T12:44:47Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Restore original weight for db1082 after its RAID controller firmware - T145533 (duration: 00m 55s)

I have added the subtask of the last crash of this server, so we can have some tracking as it's been twice already.