Page MenuHomePhabricator

Kernels errors on ganeti1005- ganeti1008 under high I/O
Closed, ResolvedPublic

Description

At 8:03 Icinga flagged errors for the instances for the instances running on ganeti1005 (and also for ganeti1005 itself). When the alert came in it was in fact non-accessible, but while I was logging in via mgmt it recovered.

dmesg is full of errors like

[13206136.352034]  ehci_hcd lrw gf128mul tg3 ablk_helper ptp cryptd libata megaraid_sas pps_core usbcore libphy usb_common scsi_mod
[13206136.352043] CPU: 23 PID: 4183 Comm: gnt-node Tainted: G    B   W       4.9.0-0.bpo.3-amd64 #1 Debian 4.9.25-1~bpo8+3
[13206136.352044] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
[13206136.352045]  0000000000000000 ffffffffa6329be5 fffff0bf23dbc240 ffffffffa6a00df0
[13206136.352048]  ffffffffa618313c 0000000000000010 0000000000000009 fffff0bf23dbade0
[13206136.352050]  ffffffffa6186af3 ffff904500000001 ffff9045ae544c30 ffff904500000001
[13206136.352052] Call Trace:
[13206136.352060]  [<ffffffffa6329be5>] ? dump_stack+0x5c/0x77
[13206136.352062]  [<ffffffffa618313c>] ? bad_page+0xbc/0x120
[13206136.352065]  [<ffffffffa6186af3>] ? get_page_from_freelist+0x993/0xad0
[13206136.352067]  [<ffffffffa6187c27>] ? __alloc_pages_nodemask+0xf7/0x270
[13206136.352070]  [<ffffffffa61da7d0>] ? alloc_pages_vma+0xb0/0x240
[13206136.352074]  [<ffffffffa61f93ac>] ? mem_cgroup_commit_charge+0x7c/0xf0
[13206136.352076]  [<ffffffffa61b6d31>] ? handle_mm_fault+0x1441/0x1700
[13206136.352080]  [<ffffffffa605fe53>] ? __do_page_fault+0x253/0x510
[13206136.352085]  [<ffffffffa66069d8>] ? page_fault+0x28/0x30
[13206136.352086] BUG: Bad page state in process gnt-node  pfn:8f6f0a

Since this appeared out of the blue my guess is a memory error. I'd say we take down the host per https://wikitech.wikimedia.org/wiki/Ganeti#Shutdown_a_node_for_a_prolonged_period_of_time and run a memory test.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The memtester did not trigger anything since Dec 3. It's already on loop 7 and given we need the capacity to safely empty ganeti1006, I had to stop it to migrate VMs back to ganeti1008

Mentioned in SAL (#wikimedia-operations) [2017-12-03T12:17:25Z] <akosiaris> empty ganeti1006, it had issues this morning per T181121

No, I think the Launchpad bug is unrelated: It seems an older kernel in the 3.2 days had a bug which exposed the same fallout, but our current kernels are five years more recent. In our case I still think we're seeing a hardware issue; the ganeti1001-ganeti1004 run the same software config and comparable load characteristics and none of those are affected.

Pasting various finding from ganeti1006 to try and get a clearer picture

Dec  3 08:01:32 ganeti1006 kernel: [318591.369562] qemu-system-x86: page allocation stalls for 13572ms, order:0, mode:0x34200c0(GFP_USER|__GFP_WRITE)
Dec  3 08:01:32 ganeti1006 kernel: [318591.369572] CPU: 25 PID: 22496 Comm: qemu-system-x86 Not tainted 4.9.0-0.bpo.4-amd64 #1 Debian 4.9.51-1~bpo8+1
Dec  3 08:01:32 ganeti1006 kernel: [318591.369573] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Dec  3 08:01:32 ganeti1006 kernel: [318591.369575]  0000000000000000 ffffffffa892b2e5 ffffffffa90027b8 ffffafc58682ba78
Dec  3 08:01:32 ganeti1006 kernel: [318591.369578]  ffffffffa8787c58 034200c000000001 ffffffffa90027b8 ffffafc58682ba18
Dec  3 08:01:32 ganeti1006 kernel: [318591.369581]  0000000000000010 ffffafc58682ba88 ffffafc58682ba38 601dab9254d44b36
Dec  3 08:01:32 ganeti1006 kernel: [318591.369583] Call Trace:
Dec  3 08:01:32 ganeti1006 kernel: [318591.369591]  [<ffffffffa892b2e5>] ? dump_stack+0x5c/0x77
Dec  3 08:01:32 ganeti1006 kernel: [318591.369595]  [<ffffffffa8787c58>] ? warn_alloc+0x148/0x170
Dec  3 08:01:32 ganeti1006 kernel: [318591.369598]  [<ffffffffa8788604>] ? __alloc_pages_slowpath+0x904/0xb60
Dec  3 08:01:32 ganeti1006 kernel: [318591.369600]  [<ffffffffa8788a68>] ? __alloc_pages_nodemask+0x208/0x270
Dec  3 08:01:32 ganeti1006 kernel: [318591.369603]  [<ffffffffa87d916a>] ? alloc_pages_current+0x8a/0x110
Dec  3 08:01:32 ganeti1006 kernel: [318591.369607]  [<ffffffffa877f3e6>] ? pagecache_get_page+0x126/0x2b0
Dec  3 08:01:32 ganeti1006 kernel: [318591.369610]  [<ffffffffa883f090>] ? I_BDEV+0x10/0x10
Dec  3 08:01:32 ganeti1006 kernel: [318591.369613]  [<ffffffffa877f58c>] ? grab_cache_page_write_begin+0x1c/0x40
Dec  3 08:01:32 ganeti1006 kernel: [318591.369615]  [<ffffffffa883d9e8>] ? block_write_begin+0x28/0xc0
Dec  3 08:01:32 ganeti1006 kernel: [318591.369617]  [<ffffffffa877ddfe>] ? generic_perform_write+0xce/0x1d0
Dec  3 08:01:32 ganeti1006 kernel: [318591.369622]  [<ffffffffa86f901e>] ? futex_wait+0x1de/0x260
Dec  3 08:01:32 ganeti1006 kernel: [318591.369624]  [<ffffffffa8780706>] ? __generic_file_write_iter+0x196/0x1e0
Dec  3 08:01:32 ganeti1006 kernel: [318591.369626]  [<ffffffffa883fec9>] ? blkdev_write_iter+0x89/0x100
Dec  3 08:01:32 ganeti1006 kernel: [318591.369630]  [<ffffffffa8805aa7>] ? rw_copy_check_uvector+0x67/0x130
Dec  3 08:01:32 ganeti1006 kernel: [318591.369632]  [<ffffffffa8803aa0>] ? do_iter_readv_writev+0xb0/0x130
Dec  3 08:01:32 ganeti1006 kernel: [318591.369634]  [<ffffffffa88046fc>] ? do_readv_writev+0x1ac/0x240
Dec  3 08:01:32 ganeti1006 kernel: [318591.369637]  [<ffffffffa8804b3c>] ? do_pwritev+0x8c/0xc0
Dec  3 08:01:32 ganeti1006 kernel: [318591.369640]  [<ffffffffa8c0893b>] ? system_call_fast_compare_end+0xc/0x9b

So 13 seconds for a page allocation. That's not good.

And it only gets worse

Dec  3 08:02:57 ganeti1006 kernel: [318661.544761] qemu-system-x86: page allocation stalls for 70064ms, order:0, mode:0x24000c0(GFP_KERNEL)

so 70 seconds now.

Followed by OOM showing up

Dec  3 08:02:57 ganeti1006 kernel: [318673.617970] Out of memory: Kill process 24714 (qemu-system-x86) score 85 or sacrifice child

Then DRBD wakes up (and this accounts for all the network/disk io increased usage we 've seen

Dec  3 08:02:57 ganeti1006 kernel: [318676.171675] drbd resource12: meta connection shut down by peer.
Dec  3 08:02:57 ganeti1006 kernel: [318676.178433] drbd resource12: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
Dec  3 08:02:57 ganeti1006 kernel: [318676.178531] drbd resource12: ack_receiver terminated
Dec  3 08:02:57 ganeti1006 kernel: [318676.178533] drbd resource12: Terminating drbd_a_resource
Dec  3 08:02:57 ganeti1006 kernel: [318676.255451] drbd resource12: Connection closed
Dec  3 08:02:57 ganeti1006 kernel: [318676.255493] drbd resource12: conn( NetworkFailure -> Unconnected ) 
Dec  3 08:02:57 ganeti1006 kernel: [318676.255499] drbd resource12: receiver terminated
Dec  3 08:02:57 ganeti1006 kernel: [318676.255500] drbd resource12: Restarting receiver thread
Dec  3 08:02:57 ganeti1006 kernel: [318676.255501] drbd resource12: receiver (re)started
Dec  3 08:02:57 ganeti1006 kernel: [318676.255531] drbd resource12: conn( Unconnected -> WFConnection ) 
Dec  3 08:02:58 ganeti1006 kernel: [318676.787489] drbd resource12: Handshake successful: Agreed network protocol version 101
Dec  3 08:02:58 ganeti1006 kernel: [318676.787491] drbd resource12: Feature flags enabled on protocol level: 0x7 TRIM THIN_RESYNC WRITE_SAME.
Dec  3 08:02:58 ganeti1006 kernel: [318676.787905] drbd resource12: Peer authenticated using 16 bytes HMAC
Dec  3 08:02:58 ganeti1006 kernel: [318676.787968] drbd resource12: conn( WFConnection -> WFReportParams ) 
Dec  3 08:02:58 ganeti1006 kernel: [318676.787975] drbd resource12: Starting ack_recv thread (from drbd_r_resource [15709])

And then we 100 seconds page allocation stalls a bit later

Dec  3 08:02:59 ganeti1006 kernel: [318678.444596] qemu-system-x86: page allocation stalls for 100572ms, order:0, mode:0x24000c0(GFP_KERNEL)

Judging by just looking at the log lines, the number of seconds per stall is consistently increasing, although not monotonically (that is the next of this stall was for 90 secs, the next for 89 and then 101)

Then we get

Dec  3 08:03:02 ganeti1006 kernel: [318680.732257] ------------[ cut here ]------------
Dec  3 08:03:02 ganeti1006 kernel: [318680.732270] WARNING: CPU: 6 PID: 25517 at /build/linux-HB2ADw/linux-4.9.51/arch/x86/kvm/mmu.c:614 mmu_spte_clear_track_bits+0x91/0xf0 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732271] Modules linked in: binfmt_misc nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_tcpudp xt_pkttype nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables x_tables 8021q garp mrp bridge stp llc intel_rapl sb_edac edac_core x86_pkg_temp_thermal mgag200 intel_powerclamp ttm coretemp drm_kms_helper kvm_intel drm kvm i2c_algo_bit iTCO_wdt iTCO_vendor_support mxm_wmi irqbypass crct10dif_pclmul lpc_ich crc32_pclmul mei_me evdev dcdbas ghash_clmulni_intel pcspkr shpchp mei mfd_core ipmi_si wmi button vhost_net tun vhost macvtap macvlan ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c crc32c_generic nf_conntrack autofs4 ext4 crc16 jbd2 fscrypto mbcache dm_mod sg sd_mod crc32c_intel ahci libahci ehci_pci aesni_intel aes_x86_64 glue_helper lrw tg3 gf128mul ehci_hcd
Dec  3 08:03:02 ganeti1006 kernel: [318680.732307]  ablk_helper ptp cryptd libata megaraid_sas pps_core usbcore libphy usb_common scsi_mod
Dec  3 08:03:02 ganeti1006 kernel: [318680.732313] CPU: 6 PID: 25517 Comm: qemu-system-x86 Tainted: G        W       4.9.0-0.bpo.4-amd64 #1 Debian 4.9.51-1~bpo8+1
Dec  3 08:03:02 ganeti1006 kernel: [318680.732314] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Dec  3 08:03:02 ganeti1006 kernel: [318680.732315]  0000000000000000 ffffffffa892b2e5 0000000000000000 0000000000000000
Dec  3 08:03:02 ganeti1006 kernel: [318680.732317]  ffffffffa8677944 00000002f3800ff7 00000000002f3800 ffff898900eac000
Dec  3 08:03:02 ganeti1006 kernel: [318680.732320]  ffff898900eaebf8 0000000000000000 0000000000000005 ffffffffc0768e11
Dec  3 08:03:02 ganeti1006 kernel: [318680.732323] Call Trace:
Dec  3 08:03:02 ganeti1006 kernel: [318680.732325]  [<ffffffffa892b2e5>] ? dump_stack+0x5c/0x77
Dec  3 08:03:02 ganeti1006 kernel: [318680.732327]  [<ffffffffa8677944>] ? __warn+0xc4/0xe0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732341]  [<ffffffffc0768e11>] ? mmu_spte_clear_track_bits+0x91/0xf0 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732354]  [<ffffffffc076c507>] ? drop_spte+0x17/0x90 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732368]  [<ffffffffc076c913>] ? mmu_page_zap_pte+0x43/0xc0 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732381]  [<ffffffffc076d773>] ? kvm_mmu_prepare_zap_page+0x63/0x2b0 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732395]  [<ffffffffc0773b4f>] ? kvm_mmu_invalidate_zap_all_pages+0xcf/0x120 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732408]  [<ffffffffc07492f7>] ? kvm_mmu_notifier_release+0x27/0x40 [kvm]
Dec  3 08:03:02 ganeti1006 kernel: [318680.732410]  [<ffffffffa87dd2be>] ? __mmu_notifier_release+0x3e/0xc0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732412]  [<ffffffffa87bd116>] ? exit_mmap+0x146/0x150
Dec  3 08:03:02 ganeti1006 kernel: [318680.732414]  [<ffffffffa86749bb>] ? mmput+0x5b/0x100
Dec  3 08:03:02 ganeti1006 kernel: [318680.732416]  [<ffffffffa867bf04>] ? do_exit+0x334/0xb50
Dec  3 08:03:02 ganeti1006 kernel: [318680.732419]  [<ffffffffa877ddbe>] ? generic_perform_write+0x8e/0x1d0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732421]  [<ffffffffa867c799>] ? do_group_exit+0x39/0xb0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732422]  [<ffffffffa86878c3>] ? get_signal+0x2c3/0x6c0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732424]  [<ffffffffa8625656>] ? do_signal+0x36/0x700
Dec  3 08:03:02 ganeti1006 kernel: [318680.732427]  [<ffffffffa880374e>] ? new_sync_write+0xde/0x130
Dec  3 08:03:02 ganeti1006 kernel: [318680.732429]  [<ffffffffa8603682>] ? exit_to_usermode_loop+0x82/0xc0
Dec  3 08:03:02 ganeti1006 kernel: [318680.732431]  [<ffffffffa8603a87>] ? syscall_return_slowpath+0x97/0x110
Dec  3 08:03:02 ganeti1006 kernel: [318680.732432]  [<ffffffffa8c089c8>] ? system_call_fast_compare_end+0x99/0x9b
Dec  3 08:03:02 ganeti1006 kernel: [318680.732433] ---[ end trace ac2b24b1d4702250 ]---
Dec  3 08:03:02 ganeti1006 kernel: [318680.732434] ------------[ cut here ]------------

This pattern continues for about 4 secs with thousands of log lines and then we finally get the already seen before log

Dec  3 08:04:04 ganeti1006 kernel: [318743.346616] BUG: Bad page state: 2944 messages suppressed
Dec  3 08:04:04 ganeti1006 kernel: [318743.352741] BUG: Bad page state in process qemu-system-x86  pfn:33a400
Dec  3 08:04:04 ganeti1006 kernel: [318743.360125] page:fffff47e4ce90000 count:0 mapcount:0 mapping:          (null) index:0x0
Dec  3 08:04:04 ganeti1006 kernel: [318743.369165] flags: 0x17fffc000000014(referenced|dirty)
Dec  3 08:04:04 ganeti1006 kernel: [318743.374994] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
Dec  3 08:04:04 ganeti1006 kernel: [318743.381987] bad because of flags: 0x14(referenced|dirty)

For the next 38 minutes the logs alternate between the above findings with oom showing up a few times and then everything stops.

memtester had repeatedly failed to uncover anything. After some google reading I see that some people have had simular issues using Linux test project (https://github.com/linux-test-project/ltp). Seeing no end out of this tunnel, I took the bait and compiled it on ganeti1006 (I guess I might have to reimage the poor box) and it is now running.

Mentioned in SAL (#wikimedia-operations) [2017-12-07T09:46:51Z] <akosiaris> silence ganeti1006 on icinga T181121

Ganeti1008 went down some minutes ago, a powercycle fixed it. The console was showing up a ton of printks but I didn't find anything useful.

A snippet:

** 914 printk messages dropped ** [859498.972882]  lrw** 343 printk messages dropped ** [859498.977226]  macvlan** 284 printk messages dropped ** [859498.978345] 89 ** 314 printk messages dropped ** [859498.981929]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859498.990748]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859498.999535]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.008325]  RSP <ffffb05a87317e78>
** 575 printk messages dropped ** [859499.017204] c6 ** 359 printk messages dropped ** [859499.021631]  RSP <ffffb05a87317e78>
** 533 printk messages dropped ** [859499.030217] R13: ffffb05a87317d10 R14: ffffb05a87317d30 R15: ffffb05a87317e78
** 933 printk messages dropped ** [859499.043123] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 935 printk messages dropped ** [859499.056036] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 935 printk messages dropped ** [859499.068986] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 623 printk messages dropped ** [859499.077591] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 935 printk messages dropped ** [859499.090525] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 935 printk messages dropped ** [859499.103354] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 935 printk messages dropped ** [859499.116163] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 625 printk messages dropped ** [859499.124707] R13: ffffb05a87317d10 R14: ffffb05a87317d30 R15: ffffb05a87317e78
** 933 printk messages dropped ** [859499.137522] RBP: ffffb05a87317e88 R08: 0000000000000000 R09: 000000000000002c
** 922 printk messages dropped ** [859499.146106]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.154883]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.163614]  RSP <ffffb05a87317e78>
** 311 printk messages dropped ** [859499.168010]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.176778]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.185603]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.194161]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.211319]  RSP <ffffb05a87317e78>
** 311 printk messages dropped ** [859499.215590]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.224158]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.232727]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.241307]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.249881]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.258431]  RSP <ffffb05a87317e78>
** 311 printk messages dropped ** [859499.262712]  RSP <ffffb05a87317e78>
** 623 printk messages dropped ** [859499.271266]  RSP <ffffb05a87317e78>
** 856 printk messages dropped ** [859499.284048]  c63740b19a4eef83
** 623 printk messages dropped ** [859499.292589]  c63740b19a4eef83
** 311 printk messages dropped ** [859499.296850]  c63740b19a4eef83
** 623 printk messages dropped ** [859499.305424]  c63740b19a4eef83
** 623 printk messages dropped ** [859499.313983]  c63740b19a4eef83
** 311 printk messages dropped ** [859499.318236]  c63740b19a4eef83
** 623 printk messages dropped ** [859499.326796]  c63740b19a4eef83
** 622 printk messages dropped ** [859499.335338]  ffffb05a87317e88** 311 printk messages dropped ** [859499.339626]  ffffb05a87317e88** 623 printk messages dropped ** [859499.348140]  ffffb05a87317e88** 623 printk messages dropped ** [859499.356697]  ffffb05a87317e88** 311 printk messages dropped ** [859499.360977]  ffffb05a87317e88** 623 printk messages dropped ** [859499.369543]  ffffb05a87317e88** 623 printk messages dropped ** [859499.378074]  ffffb05a87317e88** 311 printk messages dropped ** [859499.382366]  ffffb05a87317e88** 623 printk messages dropped ** [859499.390919]  ffffb05a87317e88** 606 printk messages dropped ** [859499.397627] 48 ** 314 printk messages dropped ** [859499.399484]  RSP <ffffb05a87317e78>
akosiaris added a comment.EditedDec 11 2017, 12:46 PM

Found the following in kern.log

Dec 11 08:01:28 ganeti1008 kernel: [858993.252853] ------------[ cut here ]------------
Dec 11 08:01:28 ganeti1008 kernel: [858993.258095] kernel BUG at /build/linux-HB2ADw/linux-4.9.51/fs/ext4/inode.c:2050!
Dec 11 08:01:28 ganeti1008 kernel: [858993.266446] invalid opcode: 0000 [#1] SMP
Dec 11 08:01:28 ganeti1008 kernel: [858993.271015] Modules linked in: binfmt_misc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_table
s xt_tcpudp xt_pkttype nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter ip_tables x_tables 8021q garp mrp bridge stp llc intel_r
apl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt mxm_wmi iTCO_vendor_support evdev kvm dcdbas mgag200
 ttm irqbypass crct10dif_pclmul drm_kms_helper crc32_pclmul ghash_clmulni_intel drm pcspkr i2c_algo_bit mei_me lpc_ich shpchp mei mfd_core i
pmi_si wmi button vhost_net tun vhost macvtap macvlan ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c crc32c_generic nf_conntrack auto
fs4 ext4 crc16 jbd2 fscrypto mbcache dm_mod sg sd_mod crc32c_intel ehci_pci ahci aesni_intel ehci_hcd libahci aes_x86_64 glue_helper lrw tg3
 gf128mul
Dec 11 08:01:28 ganeti1008 kernel: [858993.350870]  ablk_helper ptp cryptd libata pps_core megaraid_sas usbcore libphy usb_common scsi_mod
Dec 11 08:01:28 ganeti1008 kernel: [858993.359608] CPU: 29 PID: 224 Comm: kswapd1 Tainted: G    B   W       4.9.0-0.bpo.4-amd64 #1 Debian 4.
9.51-1~bpo8+1
Dec 11 08:01:28 ganeti1008 kernel: [858993.371256] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Dec 11 08:01:28 ganeti1008 kernel: [858993.379705] task: ffff92c09b210140 task.stack: ffffb05a87314000
Dec 11 08:01:28 ganeti1008 kernel: [858993.386406] RIP: 0010:[<ffffffffc0615f2e>]  [<ffffffffc0615f2e>] ext4_writepage+0x2ee/0x5b0 [ext4]
Dec 11 08:01:28 ganeti1008 kernel: [858993.396533] RSP: 0018:ffffb05a873179c0  EFLAGS: 00010246
Dec 11 08:01:28 ganeti1008 kernel: [858993.402556] RAX: 057fffc000030009 RBX: 00000000000005d8 RCX: 0000000000000000
Dec 11 08:01:28 ganeti1008 kernel: [858993.410616] RDX: 0000000000000000 RSI: ffffb05a87317a38 RDI: fffff1753fc1fd80
Dec 11 08:01:28 ganeti1008 kernel: [858993.418673] RBP: ffff92b8983c7b08 R08: 0000000000020a88 R09: 0000000000000002
Dec 11 08:01:28 ganeti1008 kernel: [858993.426731] R10: ffff92c0bfffa000 R11: 0000000000000000 R12: ffffb05a87317a38
Dec 11 08:01:28 ganeti1008 kernel: [858993.434799] R13: ffffb05a87317c30 R14: 0000000000000001 R15: fffff1753fc1fd80
Dec 11 08:01:28 ganeti1008 kernel: [858993.442859] FS:  0000000000000000(0000) GS:ffff92c09f580000(0000) knlGS:0000000000000000
Dec 11 08:01:28 ganeti1008 kernel: [858993.451984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 11 08:01:28 ganeti1008 kernel: [858993.458492] CR2: 00007f3e92a52160 CR3: 0000001056407000 CR4: 00000000003426e0
Dec 11 08:01:28 ganeti1008 kernel: [858993.466552] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 11 08:01:28 ganeti1008 kernel: [858993.474611] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 11 08:01:28 ganeti1008 kernel: [858993.482671] Stack:
Dec 11 08:01:28 ganeti1008 kernel: [858993.485011]  ffffffffb53c434e 0000000000000000 ffffb05a873179cc ffffffffb53c26c0
Dec 11 08:01:28 ganeti1008 kernel: [858993.493403]  0000000000000000 0000000000000000 0000000000000246 c63740b19a4eef83
Dec 11 08:01:28 ganeti1008 kernel: [858993.501794]  fffff1753fc1fd80 ffff92b8983c7c80 ffff92b8983c7b08 ffffb05a87317c30
Dec 11 08:01:28 ganeti1008 kernel: [858993.510186] Call Trace:
Dec 11 08:01:28 ganeti1008 kernel: [858993.513014]  [<ffffffffb53c434e>] ? page_mkclean+0x6e/0xc0
Dec 11 08:01:28 ganeti1008 kernel: [858993.519236]  [<ffffffffb53c26c0>] ? __page_check_address+0x1b0/0x1b0
Dec 11 08:01:28 ganeti1008 kernel: [858993.526426]  [<ffffffffb5394834>] ? pageout.isra.41+0x184/0x2b0
Dec 11 08:01:28 ganeti1008 kernel: [858993.533130]  [<ffffffffb53966ac>] ? shrink_page_list+0x8fc/0x9f0
Dec 11 08:01:28 ganeti1008 kernel: [858993.539932]  [<ffffffffb5396f28>] ? shrink_inactive_list+0x218/0x520
Dec 11 08:01:28 ganeti1008 kernel: [858993.547121]  [<ffffffffb5397b73>] ? shrink_node_memcg+0x5e3/0x7c0
Dec 11 08:01:28 ganeti1008 kernel: [858993.554021]  [<ffffffffb52a2801>] ? migrate_swap_stop+0x111/0x1b0
Dec 11 08:01:28 ganeti1008 kernel: [858993.560910]  [<ffffffffb52a2801>] ? migrate_swap_stop+0x111/0x1b0
Dec 11 08:01:28 ganeti1008 kernel: [858993.567811]  [<ffffffffb5397e32>] ? shrink_node+0xe2/0x320
Dec 11 08:01:28 ganeti1008 kernel: [858993.574032]  [<ffffffffb5398e77>] ? kswapd+0x317/0x700
Dec 11 08:01:28 ganeti1008 kernel: [858993.579857]  [<ffffffffb5398b60>] ? mem_cgroup_shrink_node+0x180/0x180
Dec 11 08:01:28 ganeti1008 kernel: [858993.587241]  [<ffffffffb5297640>] ? kthread+0xf0/0x110
Dec 11 08:01:28 ganeti1008 kernel: [858993.593075]  [<ffffffffb52247d9>] ? __switch_to+0x2c9/0x720
Dec 11 08:01:28 ganeti1008 kernel: [858993.599393]  [<ffffffffb5297550>] ? kthread_park+0x60/0x60
Dec 11 08:01:28 ganeti1008 kernel: [858993.605612]  [<ffffffffb5808bb5>] ? ret_from_fork+0x25/0x30
Dec 11 08:01:28 ganeti1008 kernel: [858993.611925] Code: 00 00 81 fa 00 04 00 00 74 23 48 8b 95 68 ff ff ff 80 e6 40 74 07 a9 00 00 00 08 74
 10 f6 c4 08 0f 85 a5 fd ff ff e8 35 9f 04 00 <0f> 0b 48 8b 85 68 ff ff ff f6 c4 08 0f 85 8e fd ff ff e9 75 fe 
Dec 11 08:01:28 ganeti1008 kernel: [858993.633721] RIP  [<ffffffffc0615f2e>] ext4_writepage+0x2ee/0x5b0 [ext4]
Dec 11 08:01:28 ganeti1008 kernel: [858993.641238]  RSP <ffffb05a873179c0>
Dec 11 08:01:28 ganeti1008 kernel: [858993.648986] ---[ end trace 70d1083fa2f74d44 ]---

From a first look, it looks unrelated to the other symptoms we are seeing.

On the other hand, on Saturday Dec 9 2017 we got the following

Dec  9 08:04:15 ganeti1008 kernel: [686354.706295] qemu-system-x86: page allocation stalls for 10880ms, order:0, mode:0x24000c0(GFP_KERNEL)
Dec  9 08:04:15 ganeti1008 kernel: [686354.706307] CPU: 26 PID: 22044 Comm: qemu-system-x86 Not tainted 4.9.0-0.bpo.4-amd64 #1 Debian 4.9.51-1~bpo8+1
Dec  9 08:04:15 ganeti1008 kernel: [686354.706308] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.4.2 01/09/2017
Dec  9 08:04:15 ganeti1008 kernel: [686354.706310]  0000000000000000 ffffffffb552b2e5 ffffffffb5c027b8 ffffb05a892438d0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706314]  ffffffffb5387c58 024000c0e4f69c54 ffffffffb5c027b8 ffffb05a89243870
Dec  9 08:04:15 ganeti1008 kernel: [686354.706317]  0000000000000010 ffffb05a892438e0 ffffb05a89243890 0dbf38eae4f69c54
Dec  9 08:04:15 ganeti1008 kernel: [686354.706320] Call Trace:
Dec  9 08:04:15 ganeti1008 kernel: [686354.706331]  [<ffffffffb552b2e5>] ? dump_stack+0x5c/0x77
Dec  9 08:04:15 ganeti1008 kernel: [686354.706336]  [<ffffffffb5387c58>] ? warn_alloc+0x148/0x170
Dec  9 08:04:15 ganeti1008 kernel: [686354.706339]  [<ffffffffb5388604>] ? __alloc_pages_slowpath+0x904/0xb60
Dec  9 08:04:15 ganeti1008 kernel: [686354.706342]  [<ffffffffb5388a68>] ? __alloc_pages_nodemask+0x208/0x270
Dec  9 08:04:15 ganeti1008 kernel: [686354.706346]  [<ffffffffb53d916a>] ? alloc_pages_current+0x8a/0x110
Dec  9 08:04:15 ganeti1008 kernel: [686354.706348]  [<ffffffffb538403a>] ? __get_free_pages+0xa/0x30
Dec  9 08:04:15 ganeti1008 kernel: [686354.706353]  [<ffffffffb5419525>] ? __pollwait+0x95/0xf0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706359]  [<ffffffffb57a65c5>] ? unix_poll+0x25/0xb0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706363]  [<ffffffffb56e52e1>] ? sock_poll+0x41/0x90
Dec  9 08:04:15 ganeti1008 kernel: [686354.706366]  [<ffffffffb541ab66>] ? do_sys_poll+0x326/0x560
Dec  9 08:04:15 ganeti1008 kernel: [686354.706371]  [<ffffffffb52add91>] ? check_preempt_wakeup+0x171/0x1f0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706376]  [<ffffffffb52a28f4>] ? try_to_wake_up+0x54/0x380
Dec  9 08:04:15 ganeti1008 kernel: [686354.706379]  [<ffffffffb56ecb4e>] ? __kmalloc_reserve.isra.34+0x2e/0x80
Dec  9 08:04:15 ganeti1008 kernel: [686354.706383]  [<ffffffffb52bbb88>] ? swake_up_locked+0x28/0x40
Dec  9 08:04:15 ganeti1008 kernel: [686354.706385]  [<ffffffffb52bbbc7>] ? swake_up+0x27/0x40
Dec  9 08:04:15 ganeti1008 kernel: [686354.706411]  [<ffffffffc082bbc3>] ? kvm_vcpu_kick+0x63/0x80 [kvm]
Dec  9 08:04:15 ganeti1008 kernel: [686354.706413]  [<ffffffffb5419490>] ? poll_initwait+0x40/0x40
Dec  9 08:04:15 ganeti1008 kernel: [686354.706415]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706417]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706418]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706420]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706422]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706424]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706426]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706428]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706430]  [<ffffffffb5419790>] ? poll_select_copy_remaining+0x150/0x150
Dec  9 08:04:15 ganeti1008 kernel: [686354.706432]  [<ffffffffb541b0bf>] ? SyS_ppoll+0x17f/0x1b0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706435]  [<ffffffffb540430e>] ? vfs_read+0x8e/0x130
Dec  9 08:04:15 ganeti1008 kernel: [686354.706437]  [<ffffffffb54057d6>] ? SyS_read+0x76/0xc0
Dec  9 08:04:15 ganeti1008 kernel: [686354.706441]  [<ffffffffb580893b>] ? system_call_fast_compare_end+0xc/0x9b

60 logged page allocation stalls in 4 mins reaching up to 230 s.

Looking at grafana at the time of the page allocations stalls incident (Dec 9), the box had ~30GB memory used, and ~34GB in buffers (the block level equivalent of fs pagecache), which is quite normal. Given that buffers is essentially memory ready to be reclaimed at any point in time, this does not look like memory starvation (swap was barely at 52MB at the time of the incident and never changed a bit).

ltpstress finished on ganeti1006. It did not trigger any problem.

akosiaris added a comment.EditedDec 12 2017, 4:14 PM

ganeti1007 exhibited the issue as well a few hours ago. This went largely unnoticed as icinga did not spew any alerts. This time the event has lasted way longer

akosiaris@ganeti1007:~$ sudo grep 'page alloc' /var/log/kern.log | head
Dec 12 10:40:35 ganeti1007 kernel: [15016140.945501] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:40:54 ganeti1007 kernel: [15016159.435410] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:40:54 ganeti1007 kernel: [15016159.447539] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:04 ganeti1007 kernel: [15016289.505309] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:42 ganeti1007 kernel: [15016328.245059] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:42 ganeti1007 kernel: [15016328.262679] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:42 ganeti1007 kernel: [15016328.273997] qemu-system-x86: page allocation stalls for 10008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:43 ganeti1007 kernel: [15016328.685685] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:43 ganeti1007 kernel: [15016328.736926] qemu-system-x86: page allocation stalls for 10008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 10:43:43 ganeti1007 kernel: [15016328.820148] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

and

akosiaris@ganeti1007:~$ sudo grep 'page alloc' /var/log/kern.log | wc -l
1403
akosiaris@ganeti1007:~$ sudo grep 'page alloc' /var/log/kern.log | tail
Dec 12 15:56:23 ganeti1007 kernel: [15035090.514273] qemu-system-x86: page allocation stalls for 10004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:26 ganeti1007 kernel: [15035093.793371] qemu-system-x86: page allocation stalls for 20004ms, order:2
Dec 12 15:56:26 ganeti1007 kernel: [15035093.793380] page allocation stalls for 20008ms, order:2<4>[15035093.793383] qemu-system-x86: 
Dec 12 15:56:26 ganeti1007 kernel: [15035093.793386] page allocation stalls for 20004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:26 ganeti1007 kernel: page allocation stalls for 20008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:26 ganeti1007 kernel: [15035093.805441] qemu-system-x86: page allocation stalls for 20012ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:26 ganeti1007 kernel: [15035093.834310] qemu-system-x86: page allocation stalls for 20008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:33 ganeti1007 kernel: [15035100.520095] qemu-system-x86: page allocation stalls for 20008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:36 ganeti1007 kernel: [15035103.798601] qemu-system-x86: page allocation stalls for 30004ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Dec 12 15:56:36 ganeti1007 kernel: [15035103.834970] qemu-system-x86: page allocation stalls for 30008ms, order:2, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

The final part was OOM killing the boron VM

The window for the event is visible at https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=ganeti1007&var-datasource=eqiad%20prometheus%2Fops&from=1513074742080&to=1513081122008

Cmjohnson added a comment.EditedDec 12 2017, 5:36 PM

I updated the Bios Version to 2.6

It happened on ganeti1008 just a few minutes ago:

https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=ganeti1008&var-network=eth0&from=1513150273848&to=1513153873849

Note the leading high network traffic and iops, not sure if that is relevant.

Should it have not recovered, gnt-node migrate -f ganeti1008.eqiad.wmnet would be ok? Or better waiting, as it seems intermittent?

I did it anyway, the server kept throwing scary messages not only to dmesg, but to my regular shell.

Mentioned in SAL (#wikimedia-operations) [2017-12-13T11:41:41Z] <akosiaris> empty ganeti1006 for reimage after ltpstress and bios upgrades T181121

Mentioned in SAL (#wikimedia-operations) [2017-12-13T14:03:54Z] <akosiaris> gnt-node remove ganeti1006 T181121

Happened again on ganeti1005.

I updated the Bios Version to 2.6

Which host had the BIOS updated? All of 1005-1008 or just one of them?

I updated the Bios Version to 2.6

Which host had the BIOS updated? All of 1005-1008 or just one of them?

Just ganeti1006 as a test. It still pends a reimage and to be put in service with a ton of VMs on it in order to see if we can reproduce it. However we are blocked on T182702

@akosiaris Manuel and Luca where able to reimage, not sure if in a hacky way or with a good or bad workaround, but they were able to. I think however, there is no proper upstream fix yet, at least deployed by us.

I think however, there is no proper upstream fix yet, at least deployed by us.

Yup, that's what I am waiting for. If we don't have something soon, I 'll take a look at what Manuel and Luca had to do.

Volans added a subscriber: Volans.Dec 18 2017, 8:10 AM

Powercycled ganeti1005, unable to ssh, console unresponsive.

@akosiaris reimages should be unblocked, see T182702#3844595

Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts:

ganeti1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201712191857_akosiaris_12098_ganeti1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1006.eqiad.wmnet']

Of which those FAILED:

['ganeti1006.eqiad.wmnet']

Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts:

ganeti1006.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201712191900_akosiaris_12687_ganeti1006_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['ganeti1006.eqiad.wmnet']

Of which those FAILED:

['ganeti1006.eqiad.wmnet']

@akosiaris if you're trying to reimage those as Jessie, we still have the netinst issue open, so you need to set numa=off to unblock it, see T182702.

Happened again on ganeti1005. This time the box froze hard, no kernel/syslog logs and also nothing over mgmt (which again points towards a hardware error IMO).

@akosiaris if you're trying to reimage those as Jessie, we still have the netinst issue open, so you need to set numa=off to unblock it, see T182702.

It was just stuck in d-i waiting for an enter. It's proceeding now. I 've collected the d-i's syslog to look at this later and figure out why.

Mentioned in SAL (#wikimedia-operations) [2017-12-20T11:27:32Z] <akosiaris> repool ganeti1006, rebalance row_A ganeti nodegroup. T181121

MoritzMuehlenhoff renamed this task from Possible memory errors on ganeti1005, ganeti1006, ganeti1008 to Hardware errors on ganeti1005- ganeti1008.Dec 20 2017, 11:28 AM

ganeti1006 has been added back to the group, and cluster has been rebalanced. There are a few extra VMs on ganeti1006 in order to see if this will trigger the issue.

I guess we are on wait now.

Happened again today. Unfortunately to ganeti1006, the host with the upgraded BIOS. Host stats in grafana at the time show the usual pattern [1]. Start of the event according to kern.log was at Dec 24 08:06:21, end of the event at Dec 24 08:08:44. That is 2 mins. The graphs are at [1]. The system recovered but was in weird state for the day spewing bad page state BUG lines in kern.log, continued on Dec 25 spewing the same and finally gave in today (Dec 26) and required a powercycle. The Dec 26 BUG: lines referred to ext4 while the ones on Dec 25 referred to bad page state. The erratic behavior is easily explained by problematic memory which is consistent with our theory up to now. The page allocation stalls on Dec 24 ranged from 14s to 142secs.

[1] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=ganeti1006&var-datasource=eqiad%20prometheus%2Fops&from=1514100315472&to=1514106049450

@MoritzMuehlenhoff While I still believe that this is related to hardware, I 'd like to rule out some change in 4.9. Any objections to downgrading to say 4.4.0-3 on one of the boxes ?

Mentioned in SAL (#wikimedia-operations) [2017-12-26T08:34:47Z] <akosiaris> migrate logstash1007 off ganeti1006 T181121

@MoritzMuehlenhoff While I still believe that this is related to hardware, I 'd like to rule out some change in 4.9. Any objections to downgrading to say 4.4.0-3 on one of the boxes ?

Ack, let's do that. It might also be the case that those errors are triggered by new kernels on the virtualised hosts, so if it still occurs with the virt node running 4.4, another shot might be to migrate the instance kernels on one node to 4.4.

Mentioned in SAL (#wikimedia-operations) [2018-01-02T12:21:43Z] <akosiaris> empty ganeti1008 for kernel downgrade. T181121

Mentioned in SAL (#wikimedia-operations) [2018-01-02T15:16:50Z] <akosiaris> boot ganeti1008 with older 4.4 kernel and migrate multiple VMs to it. T181121

Cmjohnson moved this task from Being worked on to Blocked on the ops-eqiad board.Jan 2 2018, 5:12 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-03T11:47:11Z] <akosiaris> boot ganeti1006. It exhibited page allocation stalls on Jan 1. T181121

During some disk IO in T186020, ganeti1005 exhibited the usual symptoms. This hasn't been triggered for ~1 month so maybe we have a reproduction scenario in some heavy IO ?

Anyway, I 'll empty ganeti1005 and reboot

Happened again on ganeti1005, similar errors, but this time triggered by a copy of the Archiva data.

Mentioned in SAL (#wikimedia-operations) [2018-02-02T14:53:04Z] <akosiaris> reboot ganeti1005 after emptying it. T181121

MoritzMuehlenhoff renamed this task from Hardware errors on ganeti1005- ganeti1008 to Kernels errors on ganeti1005- ganeti1008 under high I/O.Feb 8 2018, 11:01 AM

Looks like we have a repro case, namely an rsync without bandwidth limits triggers page allocation stalls / crash. Since vm.swappiness is set at 0 across the fleet and its meaning (vs 1-100) changed significantly in linux 3.5+ I think is worth another test with swappiness at 1 and see how that/if changes behaviour.

Looks like we have a repro case, namely an rsync without bandwidth limits triggers page allocation stalls / crash. Since vm.swappiness is set at 0 across the fleet and its meaning (vs 1-100) changed significantly in linux 3.5+ I think is worth another test with swappiness at 1 and see how that/if changes behaviour.

Agreed. Let me empty, reboot and fill ganeti1005 and let's test that.

Mentioned in SAL (#wikimedia-operations) [2018-02-08T11:12:20Z] <akosiaris> migrate all running VMs off ganeti1005 T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-08T11:33:23Z] <akosiaris> reboot ganeti1005 T181121

herron added a subscriber: herron.Feb 8 2018, 2:42 PM

Steps to repro on meitnerium:

fsck /dev/vdb1
mount /dev/vdb1 /mnt/archiva
rm -rf /mnt/archiva/*
rsync -av /var/lib/archiva /mnt/archiva --progress

Mentioned in SAL (#wikimedia-operations) [2018-02-08T14:49:08Z] <akosiaris> migrate all running VMs off ganeti1005 T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-08T16:18:26Z] <akosiaris> depool sca1004 (zotero) for T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-08T16:20:42Z] <akosiaris> reboot ganeti1005 T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-08T17:21:37Z] <akosiaris> repool sca1004 (zotero) for T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-14T14:21:08Z] <akosiaris> reboot ganeti1008 for kernel upgrade T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-16T08:14:17Z] <akosiaris> powercycle ganeti1006 T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-16T08:48:16Z] <akosiaris> doing IO stress tests on ganeti1005. T181121

First successful and intended reproduction!. After ~2 hours we got

Feb 16 10:29:06 ganeti1005 kernel: [669669.675614] qemu-system-x86: page allocation stalls for 13604ms, order:0, mode:0x34200c0(GFP_USER|__GFP_WRITE)

The reproduction case was quite interesting.

I created a new disk on sca1004 (100GB) formatted and mounted it and run the following beautiful oneliner under the mountpoint

while true; do rm 2/* ; echo 1 >/proc/sys/vm/drop_caches ; rsync -a 1/ 2/ ; done

Namely just infinitely rsync a large file (30G) from one directory in another directory, delete it and start over. Drop the FS caches just to make sure we get to re-read the entire file everytime. This is probably not required, it just makes the problem show up faster.

Mentioned in SAL (#wikimedia-operations) [2018-02-16T14:15:24Z] <akosiaris> doing more IO stress tests on ganeti1005. T181121. Seems like we can reproduce

Mentioned in SAL (#wikimedia-operations) [2018-02-16T15:14:08Z] <akosiaris> poweroff sca1004, switch from DRBD to plain disk template T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-16T15:16:39Z] <akosiaris> run T181121#3978654 oneliner once more on sca1004, this time the VM has no DRBD

A lockup happened just now on ganeti1006, though I couldn't find any kernel messages either in the host log or syslog to confirm/deny we're seeing the same issue.

Extended testing during 2 days on a VM without DRBD backing store on ganeti1005 caused no issues. Moving on to testing a DRBD disk directly on the host

Happened again on ganeti1007, again with page allocation errors.

Mentioned in SAL (#wikimedia-operations) [2018-02-26T17:32:23Z] <akosiaris> shutdown sca1004 on ganeti1005 for T181121

Mentioned in SAL (#wikimedia-operations) [2018-02-27T10:07:58Z] <akosiaris> poweroff sca1004 for T181121 tests

The following have failed up to now to reproduce

  • Heavy IO on a plain volume on a ganeti host (thankfully)
  • Heavy IO on a DRBD backed device on ganeti host
  • Heavy IO on a plain volume on a VM

The following reproduces successfully

  • Heavy IO on a DRBD backed device in a VM.

The standard way heavy IO is generated is creating a large file in a directory 1/ and then doing the following

while true; do rm 2/* ; echo 1 >/proc/sys/vm/drop_caches ; rsync -a 1/ 2/ ; done

FWIF, I checked DRDB commits between 4.9 and current Linux git, but nothing obvious (could be related to all kinds of side effects in changes in kvm or general memory management of course)

Mentioned in SAL (#wikimedia-operations) [2018-03-03T14:16:48Z] <akosiaris> 13:56:20 ema: powercycle ganeti1005 T181121

chasemp added a subscriber: chasemp.EditedMar 6 2018, 8:13 PM

The following reproduces successfully

  • Heavy IO on a DRBD backed device in a VM.

Is this on 4.4 or only 4.9?

I am wondering if there is a relation to T169290 where we also use DRBD and have issues moving from 4.4 to 4.9. i.e. IO melted everything

Is this on 4.4 or only 4.9?

We've seen this both on 4.4 and 4.9

The following reproduces successfully

  • Heavy IO on a DRBD backed device in a VM.

Is this on 4.4 or only 4.9?

I am wondering if there is a relation to T169290 where we also use DRBD and have issues moving from 4.4 to 4.9. i.e. IO melted everything

Both. We had a 4.4 host (per T181121#3867292 and T181121#3971852) from Jan 2 to Feb 14 and we eventually did see the same issue.

Mentioned in SAL (#wikimedia-operations) [2018-03-08T10:14:05Z] <akosiaris> conduct IO stresstests on ganeti1005 (sca1004 VM) with cache=none KVM flag on T181121

cache=none tests during the weekend showed no problems. I 'll find a quiet point in time during the day and restart all VMs in cluster with that setting set. Then we are in waiting for a while.

Mentioned in SAL (#wikimedia-operations) [2018-03-13T12:32:43Z] <akosiaris> reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121

All row_A eqiad VMs have been rebooted with cache=none. We are now again in a waiting period.

akosiaris changed the task status from Open to Stalled.Mar 26 2018, 4:43 PM

~2 weeks with no incident yet. That's very encouraging but we 've been in that position again around the new years holidays. Given that easter holidays are approaching I am reluctant to do any changes so I think this should stay in the waiting state for another ~2 weeks.

akosiaris changed the task status from Stalled to Open.Apr 19 2018, 7:16 AM

1 month with no incident. I 'll proceed with rebooting all ganeti VMs on row_C and then move on to codfw

Mentioned in SAL (#wikimedia-operations) [2018-04-19T07:24:16Z] <akosiaris> reboot ganeti VMs on row_A in eqiad for cache=none setting. T181121

Mentioned in SAL (#wikimedia-operations) [2018-04-19T07:47:30Z] <akosiaris> set cache=none for ganeti VMs in codfw cluster configuration. VM reboots to follow T181121

Mentioned in SAL (#wikimedia-operations) [2018-04-19T09:17:50Z] <akosiaris> reboot puppetdb1001 for cache=none setting apply. T181121

Mentioned in SAL (#wikimedia-operations) [2018-04-19T09:55:08Z] <akosiaris> reboot ganeti VMs on row_B in codfw for cache=none setting. T181121

Mentioned in SAL (#wikimedia-operations) [2018-04-19T11:14:29Z] <akosiaris@tin> Synchronized wmf-config/ProductionServices.php: T181121 (duration: 01m 16s)

akosiaris closed this task as Resolved.Apr 19 2018, 12:14 PM

All VMs have been migrated to using cache=none. I 'll resolve this successfully, hopefully we will not meet this issue again