Page MenuHomePhabricator

Instability in Stretch mw-vagrant
Open, NormalPublic

Description

I'm seeing crashes every few minutes. While logged in during one crash, I saw:

Message from syslogd@mediawikivagrant at Jan  3 20:07:27 ...
 kernel:[ 4092.405027] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffc04b790f

Message from syslogd@mediawikivagrant at Jan  3 20:07:27 ...
 kernel:[ 4092.405027] 

Message from syslogd@mediawikivagrant at Jan  3 20:07:27 ...
 kernel:[ 4092.405486] Kernel Offset: 0x3d800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

After another crash, I found this final syslog message,

Jan 3 20:29:51 mediawikivagrant jobchron[1260]: 2018-01-03T20:2Jan 3 20:37:08 mediawikivagrant systemd[1]: Starting Flush Journal to Persistent Storage...

I can't find the kernel panics in any logs. I'll try destroying and recreating, and am happy to turn on core dumps or whatever might be helpful to diagnose.

Event Timeline

awight created this task.Jan 3 2018, 8:44 PM
awight triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2018, 8:44 PM
awight added a comment.Jan 3 2018, 9:12 PM

Destroying and recreating didn't solve the problem for me. Please suggest what debug info I can collect.

bd808 added a comment.Jan 4 2018, 12:30 AM

@awight do you know what hypervisor brand and version you are using? (e.g VirtualBox, VMWare Fusion, HyperV, LXC, etc)

I've been running multiple Stretch VMs for a couple of weeks now and haven't seen any problems, so this may be environment specific. FWIW my testing has been with VirtualBox 5.1.22 and VBox native shares. I'm also using the "debian/contrib-stretch64 (virtualbox, 9.3.0)" base image and kernel 4.9.65-3.

awight added a comment.Jan 4 2018, 4:42 PM

I'm using VirtualBox 5.2.2r119230, Vagrant 2.0.1, same box and kernel as you. The machine comes with Guest Additions Version: 5.1.30_Debian r118389, which throws a version mismatch warning during boot.

My host environment is MacOS ^-^

It's not clear which shared folder backend is in use, but I assume it's native because the host's /etc/exports is empty.

The crash seems to happen more frequently when I enable the ores, ores_service, and wikilabels roles. This could be simply due to using more processor time?

I still can't find anything in the logs, no not sure how to debug. Maybe enabling core dumps to console?

awight added a comment.EditedJan 18 2018, 3:48 PM

Okay, when I enable the wikilabels service (which is currently failing and trying to restart every 10s or so), I see a few of these in kern.log:

1Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055797] BUG: unable to handle kernel paging request at 000001f600000002
2Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055820] IP: [<000001f600000002>] 0x1f600000002
3Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055840] PGD 0
4Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055848]
5Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055857] Oops: 0010 [#1] SMP
6Jan 18 15:45:55 mediawikivagrant kernel: [ 248.055866] Modules linked in: vboxsf(O) cachefiles fscache vboxvideo(O) vboxguest(O) ttm ppdev drm_kms_helper crct10dif_pclmul crc32_pclmul evdev ghash_clmulni_intel parport_pc parport serio_raw pcspkr sg video ac drm button battery sunrpc ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd floppy psmouse ahci libahci libata scsi_mod i2c_piix4 e1000
7Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056046] CPU: 0 PID: 1335 Comm: python3 Tainted: G O 4.9.0-4-amd64 #1 Debian 4.9.65-3
8Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056064] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
9Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056080] task: ffff9d1836ff5080 task.stack: ffffb7e780704000
10Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056096] RIP: 0010:[<000001f600000002>] [<000001f600000002>] 0x1f600000002
11Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056121] RSP: 0018:ffffb7e780707b80 EFLAGS: 00010246
12Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056132] RAX: 0000000000000001 RBX: ffff9d185aa75b40 RCX: dead000000000200
13Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056146] RDX: 001041ed150a6032 RSI: ffffb7e780707c00 RDI: ffff9d185aa75b98
14Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056160] RBP: ffff9d185aa3d6c0 R08: ffff9d185aa752c0 R09: 0000000000000003
15Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056174] R10: 0000000000000000 R11: 00003fffc0000000 R12: ffff9d185aa3d760
16Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056188] R13: ffff9d185aa43690 R14: ffff9d185aa75be0 R15: ffff9d185aa75b98
17Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056206] FS: 00007f7d70677700(0000) GS:ffff9d185fc00000(0000) knlGS:0000000000000000
18Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056230] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
19Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056243] CR2: 000001f600000002 CR3: 00000000366d2000 CR4: 00000000000406f0
20Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056274] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
21Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056289] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
22Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056303] Stack:
23Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056308] 0000000400000014 005f04ad01000004 0000000000000002 0000000000000000
24Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056337] ffff9d185b7c2898 000000005b7c2840 00ffffff00000d68 ffffffffaa61a4b0
25Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056358] ffff9d185b7c2840 ffffb7e780707c00 ffffb7e780707c00 0000000000000000
26Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056377] Call Trace:
27Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056392] [<ffffffffaa61a4b0>] ? d_drop+0x30/0x30
28Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056403] [<ffffffffaa61afa6>] ? d_invalidate+0xb6/0x120
29Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056415] [<ffffffffaa60d5a5>] ? lookup_fast+0x175/0x2e0
30Jan 18 15:45:55 mediawikivagrant kernel: [ 248.056837] [<ffffffffaa60e534>] ? walk_component+0x44/0x320
31Jan 18 15:45:55 mediawikivagrant kernel: [ 248.057262] [<ffffffffaa60f262>] ? link_path_walk+0x1b2/0x650
32Jan 18 15:45:55 mediawikivagrant kernel: [ 248.057703] [<ffffffffaa60f806>] ? path_lookupat+0x86/0x120
33Jan 18 15:45:55 mediawikivagrant kernel: [ 248.058193] [<ffffffffc044783f>] ? sf_stat+0x5f/0x110 [vboxsf]
34Jan 18 15:45:55 mediawikivagrant kernel: [ 248.058675] [<ffffffffaa612231>] ? filename_lookup+0xb1/0x180
35Jan 18 15:45:55 mediawikivagrant kernel: [ 248.059062] [<ffffffffaa5fdffa>] ? __check_object_size+0xfa/0x1d8
36Jan 18 15:45:55 mediawikivagrant kernel: [ 248.059482] [<ffffffffaa757138>] ? strncpy_from_user+0x48/0x160
37Jan 18 15:45:55 mediawikivagrant kernel: [ 248.059862] [<ffffffffaa611e6a>] ? getname_flags+0x6a/0x1e0
38Jan 18 15:45:55 mediawikivagrant kernel: [ 248.060242] [<ffffffffaa606f29>] ? vfs_fstatat+0x59/0xb0
39Jan 18 15:45:55 mediawikivagrant kernel: [ 248.060644] [<ffffffffaa60747a>] ? SYSC_newstat+0x2a/0x60
40Jan 18 15:45:55 mediawikivagrant kernel: [ 248.061018] [<ffffffffaaa075fb>] ? system_call_fast_compare_end+0xc/0x9b
41Jan 18 15:45:55 mediawikivagrant kernel: [ 248.061343] Code: Bad RIP value.
42Jan 18 15:45:55 mediawikivagrant kernel: [ 248.061652] RIP [<000001f600000002>] 0x1f600000002
43Jan 18 15:45:55 mediawikivagrant kernel: [ 248.062126] RSP <ffffb7e780707b80>
44Jan 18 15:45:55 mediawikivagrant kernel: [ 248.062510] CR2: 000001f600000002
45Jan 18 15:45:55 mediawikivagrant kernel: [ 248.063091] ---[ end trace abad489f94f7f160 ]---

The machine doesn't crash, but that sure looks like something to go to the doctor about.

bd808 added a comment.Jan 18 2018, 4:40 PM

That looks like a kernel module null-pointer bug. My kernel trace reading fu isn't strong enough to know what is causing it though.

awight removed a subscriber: awight.Mar 21 2019, 4:02 PM