Page MenuHomePhabricator

fuse-dfs problems on stat1002
Closed, ResolvedPublic

Description

Load went up massively on stat1002 last night (around 600), which seems to be caused by processes continuously accessing a hung hdfs mount (/mnt/hdfs). syslog logged the following error:

[6520848.660930] INFO: task fuse_dfs:1450 blocked for more than 120 seconds.
[6520848.667719] Not tainted 3.13.0-65-generic #105-Ubuntu
[6520848.673484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[6520848.681485] fuse_dfs D ffff88080fc13180 0 1450 1 0x00000000
[6520848.681496] ffff8810018ddde8 0000000000000086 ffff881001490000 ffff8810018ddfd8
[6520848.681503] 0000000000013180 0000000000013180 ffff881001490000 ffff881001490000
[6520848.681514] ffff8810009f4d60 ffff8810009f4d68 00007fd78bbf3000 ffff881001490000
[6520848.681523] Call Trace:
[6520848.681534] [<ffffffff81728389>] schedule+0x29/0x70
[6520848.681541] [<ffffffff8172b215>] rwsem_down_read_failed+0xf5/0x150
[6520848.681548] [<ffffffff813732a4>] call_rwsem_down_read_failed+0x14/0x30
[6520848.681554] [<ffffffff8172a9e0>] ? down_read+0x20/0x30
[6520848.681559] [<ffffffff8109f1f2>] task_numa_work+0xd2/0x300
[6520848.681566] [<ffffffff81088557>] task_work_run+0xa7/0xe0
[6520848.681573] [<ffffffff81013ed7>] do_notify_resume+0x97/0xb0
[6520848.681578] [<ffffffff8172c5e2>] retint_signal+0x48/0x86

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff added a project: SRE.

It fixed the mount manually, see SAL:

09:32 moritzm: umounted/remounted hdfs mount on stat1002 (got stuck due to kernel bug, see T121492)

But the load was still excessive later on and Faidon eventually rebooted the system:

11:33 paravoid: force-rebooting stat1002, kernel borked because of fuse

Ottomata triaged this task as Medium priority.
Ottomata set Security to None.