Page MenuHomePhabricator

wmf-auto-restart occasionally errors on fuse mounts
Closed, ResolvedPublic

Description

We occasionally see the following error on systems with HDFS. we should update wmf-auto-restart with the ability to ignore specific file systems

lsof: WARNING: can't stat() fuse.fuse_dfs file system /mnt/hdfs
      Output information may be incomplete.

Event Timeline

jbond created this task.Mar 5 2019, 11:29 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptMar 5 2019, 11:29 AM
jbond triaged this task as Medium priority.Mar 5 2019, 11:29 AM

is that reproducible? Otherwise this might be caused by the stability issues we see with hdfs/fuse in general.

jbond added a comment.EditedMar 5 2019, 11:56 AM

This is reproducible but not reliably, some file operation taking part on fuse e.g. ls -la /mnt/hdfs/tmp seem to cause lsof to fail. its is almost certainly to do with hdfs fuse stability issues. I think we could remove this noise with any of the following options

suppress warnings

lsof -w

dont preform stat operations and suppress warnings as using -b causes its own set of warnings

lsof -bw

exclude the mount point from stat operations

lsof -e /mnt/hdfs

I would probably vote for lsof -bw as its simple to implement and i'm not sure we are that bothered about warnings with this tool

-w sounds good, but let's check first what kind of errors lsof potentially warns about, not that we miss something important in the future.

Another possible angle (if lsof supports that, didn't check) would be to exclude some directories entirely from scanning, the HDFS mount point doesn't contain and executables which might have a library reference, so omitting it in total is also a nice performance optimisation (the lsof runs take notably longer on e.g. the stat hosts compared to other production hosts).

jbond added a comment.Mar 6 2019, 10:34 AM

the last option excludes mount points which would work for this case. As far as i can see you can only remove directories from the output which wouldn't stop the warning from triggering

also a very crude list of warnings

$ strings /usr/bin/lsof | grep -i warn
%s: WARNING: can't stat() 
%s: WARNING: can't report offset; disregarding -o.
%s: WARNING: can't report file flags; disregarding +f.
%s: WARNING: unsupported format: %s
%s: WARNING: can't stat(
%s: WARNING: can't opendir(
%s: WARNING: can't lstat(
%s: WARNING: not a directory: 
%s: WARNING: no files found in directory: 
%s: WARNING: -S time (%d) changed to %d
%s: WARNING -- child process %d may be hung.
%s: WARNING: access %s: %s

Change 494764 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] Add config file and exclude_mounts options to debdeploy

https://gerrit.wikimedia.org/r/494764

Change 494765 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] Update wmf-auto-restarts to read exclude mounts from debdeploy config

https://gerrit.wikimedia.org/r/494765

Change 494764 merged by Jbond:
[operations/puppet@production] Add config file and exclude_mounts options to debdeploy

https://gerrit.wikimedia.org/r/494764

Change 494765 merged by Jbond:
[operations/puppet@production] Update wmf-auto-restarts to read exclude mounts from debdeploy config

https://gerrit.wikimedia.org/r/494765

Change 496223 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts

https://gerrit.wikimedia.org/r/496223

Change 496223 merged by Jbond:
[operations/puppet@production] Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts

https://gerrit.wikimedia.org/r/496223

Change 496405 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts

https://gerrit.wikimedia.org/r/496405

Change 496405 merged by Jbond:
[operations/puppet@production] Exclude /mnt/hdfs from lsof operations in wmf-auto-restarts

https://gerrit.wikimedia.org/r/496405

This has been deployed to all nodes which mount HDFS.

jbond closed this task as Resolved.Mar 15 2019, 10:51 AM