swift backend machines every now and then experience a lockup in xfs code that
makes them unable to process traffic and the load average skyrockets. a sample
backtrace can be found below.
next steps:
- add a nagios check for load average, so we are at least aware this is going
on
- report the bug upstream (I wasn't able to find a similar bug report) if
anything to know if that might be fixed by recent (e.g. trusty) kernel versions
sample trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.668408] BUG: soft lockup - CPU#18
stuck for 22s! [xfsaild/sdn3:1195]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676069] Modules linked in: 8021q
garp bridge stp xfs dcdbas sb_edac edac_core shpchp joydev mei(C)
acpi_power_meter wmi mac_hid lp parport usbhid hid raid10 raid456 async_pq
async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0
megaraid_sas tg3 multipath linear
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676106] CPU 18
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676107] Modules linked in: 8021q
garp bridge stp xfs dcdbas sb_edac edac_core shpchp joydev mei(C)
acpi_power_meter wmi mac_hid lp parport usbhid hid raid10 raid456 async_pq
async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0
megaraid_sas tg3 multipath linear
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676135]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676138] Pid: 1195, comm:
xfsaild/sdn3 Tainted: G C 3.2.0-60-generic #91-Ubuntu Dell Inc. PowerEdge
R720xd/0VWT90
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676144] RIP:
0010:[<ffffffff8103ebc2>] [<ffffffff8103ebc2>] __ticket_spin_lock+0x22/0x30
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676154] RSP: 0018:ffff880c048e9c20
EFLAGS: 00000297
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676157] RAX: 000000000000b512 RBX:
0000000000000000 RCX: 0000000000002238
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676160] RDX: 000000000000b513 RSI:
ffff880c0682a6a8 RDI: ffff880c046c6f98
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676163] RBP: ffff880c048e9c20 R08:
0000000000000001 R09: 0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676166] R10: ffff880c048e9a24 R11:
0000000000000000 R12: 0000000000000001
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676169] R13: ffff88012e209b00 R14:
0003000100000133 R15: ffff880c048e9b88
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676173] FS: 0000000000000000(0000)
GS:ffff880617d20000(0000) knlGS:0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676176] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676179] CR2: 00000000013bb730 CR3:
0000000001c05000 CR4: 00000000000406e0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676182] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676186] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676189] Process xfsaild/sdn3 (pid:
1195, threadinfo ffff880c048e8000, task ffff880c04905c00)
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676192] Stack:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678624] ffff880c048e9c30
ffffffff81661cce ffff880c048e9c50 ffffffffa01b82f4
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678632] ffff880c0682a600
ffff880c0682a600 ffff880c048e9c70 ffffffffa01b334e
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678639] ffff880c048e9cc0
ffff880c06ced200 ffff880c048e9ca0 ffffffffa01b3423
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678647] Call Trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681568] [<ffffffff81661cce>]
_raw_spin_lock+0xe/0x20
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681603] [<ffffffffa01b82f4>]
xfs_ail_min_lsn+0x24/0x60 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681628] [<ffffffffa01b334e>]
xlog_assign_tail_lsn+0x1e/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681649] [<ffffffffa01b3423>]
xlog_state_release_iclog+0xb3/0xd0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681671] [<ffffffffa01b40a6>]
_xfs_log_force+0x126/0x2a0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681676] [<ffffffff810917f6>] ?
down_trylock+0x36/0x50
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681697] [<ffffffffa01b4238>]
xfs_log_force+0x18/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681712] [<ffffffffa015d0a0>]
xfs_buf_trylock+0xd0/0xe0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681733] [<ffffffffa01b6108>]
xfs_buf_item_trylock+0x58/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681755] [<ffffffffa01b869a>]
xfsaild_push+0x24a/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681777] [<ffffffffa01b8b32>]
xfsaild+0x62/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681797] [<ffffffffa01b8ad0>] ?
xfsaild_push+0x680/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681803] [<ffffffff8108b8ac>]
kthread+0x8c/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681810] [<ffffffff8166c474>]
kernel_thread_helper+0x4/0x10
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681815] [<ffffffff8108b820>] ?
flush_kthread_worker+0xa0/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681821] [<ffffffff8166c470>] ?
gs_change+0x13/0x13
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681823] Code: 90 90 90 90 90 90 90
90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66
0f 1f 84 00 00 00 00 00 f3 90 <0f> b7 07 66 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17
55 31 c0 48 89
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703682] Call Trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703688] [<ffffffff81661cce>]
_raw_spin_lock+0xe/0x20
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703712] [<ffffffffa01b82f4>]
xfs_ail_min_lsn+0x24/0x60 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703732] [<ffffffffa01b334e>]
xlog_assign_tail_lsn+0x1e/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703752] [<ffffffffa01b3423>]
xlog_state_release_iclog+0xb3/0xd0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703772] [<ffffffffa01b40a6>]
_xfs_log_force+0x126/0x2a0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703777] [<ffffffff810917f6>] ?
down_trylock+0x36/0x50
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703796] [<ffffffffa01b4238>]
xfs_log_force+0x18/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703810] [<ffffffffa015d0a0>]
xfs_buf_trylock+0xd0/0xe0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703830] [<ffffffffa01b6108>]
xfs_buf_item_trylock+0x58/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703850] [<ffffffffa01b869a>]
xfsaild_push+0x24a/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703870] [<ffffffffa01b8b32>]
xfsaild+0x62/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703890] [<ffffffffa01b8ad0>] ?
xfsaild_push+0x680/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703894] [<ffffffff8108b8ac>]
kthread+0x8c/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703900] [<ffffffff8166c474>]
kernel_thread_helper+0x4/0x10
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703905] [<ffffffff8108b820>] ?
flush_kthread_worker+0xa0/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703910] [<ffffffff8166c470>] ?
gs_change+0x13/0x13
Description
Description
Details
Details
- Reference
- rt8249
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | fgiunchedi | T84385 swift backend machines load spike: cause and remediation | |||
Invalid | fgiunchedi | T117972 swift upgrade plans: jessie and swift 2.x | |||
Duplicate | None | T147162 upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges | |||
Resolved | fgiunchedi | T151648 Implement storage policies for swift |
Event Timeline
Comment Actions
Subject changed from 'swift backend machines load spike cause and remediation' to 'swift backend machines load spike: cause and remediation' by fgiunchedi
Comment Actions
On Mon Sep 01 12:48:58 2014, fgiunchedi wrote:
swift backend machines every now and then experience a lockup in xfs
code that
makes them unable to process traffic and the load average skyrockets.
a sample
backtrace can be found below.next steps:
- add a nagios check for load average, so we are at least aware this
is going
on
submitted in I4e1c26395
Comment Actions
what's left here is xfs bug(s) sending load average through the root, blocking with T117972: swift upgrade plans: jessie and swift 2.x to be checked again once we're running on linux 4.x