Page MenuHomePhabricator

swift backend machines load spike: cause and remediation
Closed, InvalidPublic

Description

swift backend machines every now and then experience a lockup in xfs code that
makes them unable to process traffic and the load average skyrockets. a sample
backtrace can be found below.
next steps:
- add a nagios check for load average, so we are at least aware this is going
on
- report the bug upstream (I wasn't able to find a similar bug report) if
anything to know if that might be fixed by recent (e.g. trusty) kernel versions
sample trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.668408] BUG: soft lockup - CPU#18
stuck for 22s! [xfsaild/sdn3:1195]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676069] Modules linked in: 8021q
garp bridge stp xfs dcdbas sb_edac edac_core shpchp joydev mei(C)
acpi_power_meter wmi mac_hid lp parport usbhid hid raid10 raid456 async_pq
async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0
megaraid_sas tg3 multipath linear
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676106] CPU 18
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676107] Modules linked in: 8021q
garp bridge stp xfs dcdbas sb_edac edac_core shpchp joydev mei(C)
acpi_power_meter wmi mac_hid lp parport usbhid hid raid10 raid456 async_pq
async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0
megaraid_sas tg3 multipath linear
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676135]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676138] Pid: 1195, comm:
xfsaild/sdn3 Tainted: G C 3.2.0-60-generic #91-Ubuntu Dell Inc. PowerEdge
R720xd/0VWT90
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676144] RIP:
0010:[<ffffffff8103ebc2>] [<ffffffff8103ebc2>] __ticket_spin_lock+0x22/0x30
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676154] RSP: 0018:ffff880c048e9c20
EFLAGS: 00000297
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676157] RAX: 000000000000b512 RBX:
0000000000000000 RCX: 0000000000002238
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676160] RDX: 000000000000b513 RSI:
ffff880c0682a6a8 RDI: ffff880c046c6f98
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676163] RBP: ffff880c048e9c20 R08:
0000000000000001 R09: 0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676166] R10: ffff880c048e9a24 R11:
0000000000000000 R12: 0000000000000001
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676169] R13: ffff88012e209b00 R14:
0003000100000133 R15: ffff880c048e9b88
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676173] FS: 0000000000000000(0000)
GS:ffff880617d20000(0000) knlGS:0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676176] CS: 0010 DS: 0000 ES: 0000
CR0: 000000008005003b
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676179] CR2: 00000000013bb730 CR3:
0000000001c05000 CR4: 00000000000406e0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676182] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676186] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676189] Process xfsaild/sdn3 (pid:
1195, threadinfo ffff880c048e8000, task ffff880c04905c00)
Aug 31 06:29:03 ms-be1010 kernel: [3152626.676192] Stack:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678624] ffff880c048e9c30
ffffffff81661cce ffff880c048e9c50 ffffffffa01b82f4
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678632] ffff880c0682a600
ffff880c0682a600 ffff880c048e9c70 ffffffffa01b334e
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678639] ffff880c048e9cc0
ffff880c06ced200 ffff880c048e9ca0 ffffffffa01b3423
Aug 31 06:29:03 ms-be1010 kernel: [3152626.678647] Call Trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681568] [<ffffffff81661cce>]
_raw_spin_lock+0xe/0x20
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681603] [<ffffffffa01b82f4>]
xfs_ail_min_lsn+0x24/0x60 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681628] [<ffffffffa01b334e>]
xlog_assign_tail_lsn+0x1e/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681649] [<ffffffffa01b3423>]
xlog_state_release_iclog+0xb3/0xd0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681671] [<ffffffffa01b40a6>]
_xfs_log_force+0x126/0x2a0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681676] [<ffffffff810917f6>] ?
down_trylock+0x36/0x50
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681697] [<ffffffffa01b4238>]
xfs_log_force+0x18/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681712] [<ffffffffa015d0a0>]
xfs_buf_trylock+0xd0/0xe0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681733] [<ffffffffa01b6108>]
xfs_buf_item_trylock+0x58/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681755] [<ffffffffa01b869a>]
xfsaild_push+0x24a/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681777] [<ffffffffa01b8b32>]
xfsaild+0x62/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681797] [<ffffffffa01b8ad0>] ?
xfsaild_push+0x680/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681803] [<ffffffff8108b8ac>]
kthread+0x8c/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681810] [<ffffffff8166c474>]
kernel_thread_helper+0x4/0x10
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681815] [<ffffffff8108b820>] ?
flush_kthread_worker+0xa0/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681821] [<ffffffff8166c470>] ?
gs_change+0x13/0x13
Aug 31 06:29:03 ms-be1010 kernel: [3152626.681823] Code: 90 90 90 90 90 90 90
90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66
0f 1f 84 00 00 00 00 00 f3 90 <0f> b7 07 66 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17
55 31 c0 48 89
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703682] Call Trace:
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703688] [<ffffffff81661cce>]
_raw_spin_lock+0xe/0x20
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703712] [<ffffffffa01b82f4>]
xfs_ail_min_lsn+0x24/0x60 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703732] [<ffffffffa01b334e>]
xlog_assign_tail_lsn+0x1e/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703752] [<ffffffffa01b3423>]
xlog_state_release_iclog+0xb3/0xd0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703772] [<ffffffffa01b40a6>]
_xfs_log_force+0x126/0x2a0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703777] [<ffffffff810917f6>] ?
down_trylock+0x36/0x50
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703796] [<ffffffffa01b4238>]
xfs_log_force+0x18/0x40 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703810] [<ffffffffa015d0a0>]
xfs_buf_trylock+0xd0/0xe0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703830] [<ffffffffa01b6108>]
xfs_buf_item_trylock+0x58/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703850] [<ffffffffa01b869a>]
xfsaild_push+0x24a/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703870] [<ffffffffa01b8b32>]
xfsaild+0x62/0xc0 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703890] [<ffffffffa01b8ad0>] ?
xfsaild_push+0x680/0x680 [xfs]
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703894] [<ffffffff8108b8ac>]
kthread+0x8c/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703900] [<ffffffff8166c474>]
kernel_thread_helper+0x4/0x10
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703905] [<ffffffff8108b820>] ?
flush_kthread_worker+0xa0/0xa0
Aug 31 06:29:03 ms-be1010 kernel: [3152626.703910] [<ffffffff8166c470>] ?
gs_change+0x13/0x13

Details

Reference
rt8249

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 2:14 AM
rtimport added a project: ops-core.
rtimport set Reference to rt8249.

Subject changed from 'swift backend machines load spike cause and remediation' to 'swift backend machines load spike: cause and remediation' by fgiunchedi

AdminCc jeremyb added by jeremyb

On Mon Sep 01 12:48:58 2014, fgiunchedi wrote:

swift backend machines every now and then experience a lockup in xfs
code that
makes them unable to process traffic and the load average skyrockets.
a sample
backtrace can be found below.

next steps:

  • add a nagios check for load average, so we are at least aware this

is going
on

submitted in I4e1c26395

fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Jun 16 2016, 10:28 AM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".
fgiunchedi set Security to None.

what's left here is xfs bug(s) sending load average through the root, blocking with T117972: swift upgrade plans: jessie and swift 2.x to be checked again once we're running on linux 4.x

In the meantime XFS got better and load average is generally legit load on the host