Page MenuHomePhabricator

Attribute cache issue with NFS on Trusty
Closed, InvalidPublic

Description

The workflow:

  1. cronjob runs a bash script under trusty, with jsub -l release=trusty ~/exec.sh
  2. bash script exports the Ruby bins to PATH, sets the local ruby, then runs the ruby script, via:
export PATH="$HOME/.rbenv/bin:$PATH"
export PATH=$PATH:/data/project/musikbot/.rbenv/versions/2.1.0-dev/bin
rbenv local 2.1.0-dev
cd ~/MusikBot
ruby -d core.rb

Sometimes this is successful, sometimes it is not. The cron runs every 10 minutes but the Ruby script only successfully is ran about every 60 to 90 minutes. There is seemingly no rhyme or reason to the timing, but it is persistent. This has been going on for about two days.

There are two error logs, dash.err for the cron and exec.err for the bash script. Both have similar output:

Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems.rb:1194 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems.rb:1203 - cannot load such file -- rubygems/defaults/ruby
Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- mediawiki-gateway
Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- unicode
Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:55 - cannot load such file -- unicode/2.1/unicode_native
Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/rubygems/core_ext/kernel_require.rb:141 - cannot load such file -- unicode/2.1/unicode_native
Using Ext extension for JSON.
...
Exception `Errno::ENOENT' at /data/project/musikbot/.gem/ruby/2.1.0/gems/netrc-0.10.3/lib/netrc.rb:43 - No such file or directory @ rb_file_s_stat - /data/project/musikbot/.netrc
Exception `OpenSSL::SSL::SSLErrorWaitReadable' at /data/project/musikbot/.rbenv/versions/2.1.0-dev/lib/ruby/2.1.0/openssl/buffering.rb:182 - read would block
...
Exception `LoadError' at ruby - Permission denied -- core.rb
ruby: Permission denied -- core.rb (LoadError)

This output is from the -d flag used when running the Ruby script. Without the debugging output you see only ruby: Permission denied -- core.rb (LoadError). It is unclear if this is actually a permissions issue as I've made everything in the entire directory read/write/exec with chmod 0777.

This issue happened to me about two weeks ago, and somehow resolved itself. However, that time I was getting the error about permission denied when attempting to write to perm_clerk.log which is the log file for the Ruby script. That file also had full rwx permissions. I'm not sure if that issue is related to this one, but thought I'd mention it.

Any help is greatly appreciated!

Details

Related Gerrit Patches:

Event Timeline

MusikAnimal raised the priority of this task from to Needs Triage.
MusikAnimal updated the task description. (Show Details)
MusikAnimal added a project: Toolforge.
MusikAnimal added a subscriber: MusikAnimal.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 17 2015, 5:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MusikAnimal set Security to None.Jul 17 2015, 5:35 PM
MusikAnimal added a subscriber: coren.

@coren, CC'ing you only because you helped me set this up when I first got the bot going. Thank you!!

MusikAnimal updated the task description. (Show Details)Jul 17 2015, 5:38 PM
MusikAnimal updated the task description. (Show Details)
scfc added a subscriber: scfc.Jul 23 2015, 9:37 PM

(I have removed the o+w right for ~tools.musikbot/exec.sh and (recursively) ~tools.musikbot/MusikBot. I couldn't find any files not belonging to that user, so world-writable only causes high blood pressure :-).)

Looking at exec.sh:

export PATH="$HOME/.rbenv/bin:$PATH"
export PATH=$PATH:/data/project/musikbot/.rbenv/versions/2.1.0-dev/bin

This would mean that when ruby is called, it looks in ~/.rbenv/bin (no ruby there) and then (inter alia) /usr/bin and using that. /data/project/musikbot/.rbenv/versions/2.1.0-dev/bin is never searched.

I don't know how Ruby virtual environments work, but if depends on the ruby binary in /data/project/musikbot/.rbenv/versions/2.1.0-dev/bin, then that directory must be prepended to $PATH, i. e. (untested):

export PATH=/data/project/musikbot/.rbenv/versions/2.1.0-dev/bin:$PATH

Could you please try that?

scfc moved this task from Triage to In Progress on the Toolforge board.Jul 23 2015, 9:49 PM

I suspect this is a stale NFS file handle issue, I had a similar problem today.

The problem MusikAnimal has/had is intermitting, getting Permission denied sometimes, while it works 20 minutes later with no change.

Today (around 21:02 UTC, for about an hour), I had a similar issue. /data/project/crosswatch/crosswatch/backend/config.py could not be read, but only on tools-webgrid-generic-1404. Other webgrid-generic servers and bastion servers worked fine. The problem has resolved by itself, so I couldn't have a closer look.

@notconfusing mentioned something similar on IRC, if I understood it right:

[00:07:24] <notconfusing> If I make a file that required heavy processing on an instance, but I want my tool to be able to see it, where should I put it, don't they share a filesystem? so i don't need to copy it over?
[00:07:53] <notconfusing> howerevr is chmod 755'd the dir on my instance, and my tool says that permission is denied

The problem does indeed come and go, and lately (past several days) I don't think I've had any issues. I have a log file for the Ruby script and I see that it has been successfully running in full.

The weird thing is... when I just checked the bash script log (exec.err) I see the same errors I thought were preventing the script from being ran! E.g. cannot load such file -- rubygems/defaults/operating_system etc etc... I am not seeing "permission dendied", however. This tells me all those LoadErrors are unrelated to the issue I was having when I opened this task.

@Sitic's theory makes sense, I just feel like if it were an NFS issue that severe (a whole day's worth of outage) than we'd have other people reporting this. That's why I thought it was strictly Ruby-related, as to my knowledge there are few (if any) other Ruby projects running on tool labs. Did you ever run into the permissions issues for that extended period of time (a day or more)?

@scfc I see what you mean about how the Ruby I want should come first in the PATH. I've changed the bash script as recommended, but I still see the same errors, so I don't think that was it. Also the system Ruby is 1.9.3 and I'm using Ruby 2 syntax, so it would have to have found the Ruby in my ~/.rbenv or else the script would have never ran.

Thank you for the help! Should we reach out to those who manage the NFS mount? Is there something we can refer to that monitors its stability?

scfc added a comment.Jul 24 2015, 5:25 PM

@notconfusing's question was related to inter-project data transfer, AFAIUI.

There is certainly the possibility of an issue with (lack of) NFS synchronicity (CMIIW), but in this case I just gathered a list of all execution nodes from qconf -shgrp \@general and qconf -shgrp \@webgrid and tested that on all instances ~tools.musikbot/MusikBot/core.rb can be read without error by the user tools.musikbot:

scfc@tools-bastion-01:~$ while read host; do ssh "$host" 'sudo sudo -iu tools.musikbot cat ~tools.musikbot/MusikBot/core.rb > /dev/null' < /dev/null; done < hostlist.txt 
scfc@tools-bastion-01:~$

So I find it unlikely (while adding a call to hostname in exec.sh obviously might help for debugging); also (and I don't want this to sound dismissive, but whatcha gonna do) most tools are very complex, in this case more than 1 kLOCs. The possibility that there is an error in the infrastructure that makes a tool fail in 5 out of 6 or 8 out of 9 cases, but otherwise remains undetected, is very low, while it is far more probable that the tool fails in 6 out of 6 or 9 out of 9 cases, and the successes are due to manual interventions or human misinterpretation, because in a complex tool "failure" or "success" are usually not defined.

scfc added a comment.Jul 24 2015, 5:56 PM

@MusikAnimal, sorry, I posted my reply without checking if comments had come in since I opened the browser's tab.

I just looked at qacct -j exec -o tools.musikbot | less as I thought it might be related to memory consumption but on average it seems to take about 120 MBytes (default allocation is 256 MBytes) and I didn't notice any weird exit_status.

But there are a number of exit_status 1 there and I have no clue how to debug this if it isn't clear what causes the "permission denied" errors (i. e., which file or other "thing" is tried to be accessed).

valhallasw triaged this task as Normal priority.Aug 2 2015, 1:18 PM
valhallasw added a subscriber: valhallasw.

This started happening again today. Same errors as reported above, and again I'm able to execute the script directly without any issues. The last time it ran successfully via the cronjob was at 21 August at 17:40 GMT.

Are we still led to believe this is NFS-related?

scfc added a comment.Aug 22 2015, 4:06 AM

When I look at ~tools.musikbot/exec.err, the last errors are:

Exception `Errno::EACCES' at /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:628 - Permission denied @ rb_sysopen - perm_clerk.log
/data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:628:in `initialize': Permission denied @ rb_sysopen - perm_clerk.log (Errno::EACCES)
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:628:in `open'
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:628:in `open_logfile'
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:584:in `initialize'
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:318:in `new'
        from /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/logger.rb:318:in `initialize'
        from core.rb:53:in `new'

Those are not the same as above. According to /var/lib/gridengine/default/common/accounting, the job exec was last executed on tools-exec-1408 and tools-exec-1409. When I ssh there and become tools.musikbot, I can read and write ~tools.musikbot/MusikBot/perm_clerk.log without problems.

Which leads me to believe that your bot doesn't try to write to ~tools.musikbot/MusikBot/perm_clerk.log, but is in another directory where it doesn't have writing privileges and tries to write there. Or it could be a thousand other causes.

So to repeat myself: I have no clue how to debug this if it isn't clear what causes the "permission denied" errors (i. e., which file or other "thing" is tried to be accessed).

scfc added a comment.Aug 22 2015, 4:12 AM

And a closer look at ~tools.musikbot/exec.err shows multiple instances of:

Exception `LoadError' at /data/project/musikbot/.rbenv/versions/2.2.1/lib/ruby/2.2.0/rubygems/core_ext/kernel_require.rb:54 - cannot load such file -- mysql2
libgcc_s.so.1 must be installed for pthread_cancel to work
/data/project/musikbot/exec.sh: line 5: 18923 Aborted                 (core dumped) ruby -d core.rb

which is referenced in the Help:Tool Labs as an indicator that a job runs out of memory.

@scfc thanks! You are correct it wasn't the same error, but similar in nature, I guess is what I meant. I would get permission errors about various files that were fully read/write/executable, and gems that wouldn't load. So perhaps it was failing at different places. Running out of memory offers a good explanation as to why it would so unpredictable in this way... So I tagged on a -mem 250m to the jsub call and viola, the bot ran successfully at 4:20 GMT on 22 August and has been ever since!!! However when I read the docs it says jobs by default get 256MB, so I'm guessing it is sheer coincidence that things started working again :/ For safe measure I upped it to 300MB, and so far so good. Whether this fixes it permanently or not is yet to be concluded, but either way I think you tremendously for your time in helping me!

Happening again. This time I'm getting Permission denied @ rb_sysopen - perm_clerk.log (Errno::EACCES) which I've seen before. I've tried chmod 0777 perm_clerk.log, deleting the file altogether and recreating it, and giving the cronjob 350m of memory (when before it was working with 256). Do we think it's still an NFS issue?

Could this be related to T112388?

coren added a comment.Sep 16 2015, 5:06 PM

@MusikAnimal: does your script chdir() at any point?

@coren the cronjob calls a bash script that does change a directory:

export PATH="$HOME/.rbenv/bin:$PATH"
export PATH="/data/project/musikbot/.rbenv/versions/2.2.1/bin:$PATH"
cd ~/MusikBot
rbenv local 2.2.1
ruby -d core.rb
valhallasw added a comment.EditedSep 16 2015, 5:15 PM

Seems NFS related:

$ for i in tools-exec-1401 tools-exec-1408 tools-exec-1409 tools-exec-1407; do ssh $i cat /data/project/musikbot/MusikBot/perm_clerk.log | wc -l; done
cat: /data/project/musikbot/MusikBot/perm_clerk.log: Permission denied
0
52
cat: /data/project/musikbot/MusikBot/perm_clerk.log: Permission denied
0
cat: /data/project/musikbot/MusikBot/perm_clerk.log: Permission denied
0

so tools-exec-1408 works, but the other hosts return a permission denied error.

And:

valhallasw@tools-exec-1401:~$ cat /data/project/musikbot/MusikBot/perm_clerk.log | wc -l
cat: /data/project/musikbot/MusikBot/perm_clerk.log: Permission denied
0
valhallasw@tools-exec-1401:~$ sudo mount -o remount /data/project
valhallasw@tools-exec-1401:~$ cat /data/project/musikbot/MusikBot/perm_clerk.log | wc -l
52
coren claimed this task.Sep 16 2015, 5:38 PM

There is definitely an issue, and at this time it appears to be with the NFS attribute cache. I'm investigating the matter.

coren added a comment.Sep 17 2015, 4:09 PM

I can confirm after quite a bit of testing that (a) the bug can only occur on files created on one client then accessed on another, (b) the symptom goes away after the first explicit stat() call on the file and (c) I am unable to reproduce those symptoms on Precise.

Right now, my working hypothesis is that there is a difference in the handling of the attribute cache between Trusty and Precise that - in some cases - results in the attribute cache keeping an incorrect result longer than it should.

I will be testing some NFS settings to alleviate this.

Awesome, thank you @coren and @valhallasw!

coren added a comment.Sep 17 2015, 4:44 PM

In the meantime, @MusikAnimal, you may be able to work around the issue temporarily by doing a stat() on problematic files before opening them. It sucks, and I am working on a real solution, but it's better than nothing.

Thanks @coren! The issue is happening again. How would I go about using the stat command? It appears to just provide file system information.

coren added a comment.Sep 18 2015, 3:35 PM

Thanks @coren! The issue is happening again. How would I go about using the stat command? It appears to just provide file system information.

It does, but in doing so it refreshes the attribute cache. I don't know how to do it in ruby, but it's probably something along the names of "stat(filename)"; and you can simply discard anything it returns - it's the side effect that counts.

coren moved this task from To do to Doing on the Labs-Sprint-115 board.
coren renamed this task from Permission issues and/or failure to load Ruby environment on trusty to Attribute cache issue with NFS on Trusty.Sep 24 2015, 2:38 PM
coren added a comment.Sep 28 2015, 3:01 PM

After some experimenting, there appear to be two plausible solutions:

  • mount /data/project (at least) with the noac NFS option; this forces attribute caching to be disabled entirely (and thus necessarily fixes the issue) - but this has a significant performance cost as any use of the inode information requires an extra roundtrip to NFS. The disk load on the NFS server, on the other hand, is zero or insignificant as inode caching server-side will continue.
  • mount /data/project with the lookupcache=none NFS option. This appears to fix the issue in practice in a three client test as it forces any new kernel lookup of dentry record to roundtrip to the NFS server, thus forcing an update of the attributes if the actual inode changed (as would be the case when the file has been replaced, which is our current triggering scenario). This has much lighter performance impact (as it only affects system calls that access files by name), but may not fix all possible scenarios.

In both cases, the fixes turn off performance features of the Linux kernel that violate the NFS close-to-open consistency semantics. While kernels beyond 3.8 seem to be much more aggressive about use of those features, it is likely that all modern Kernels may show those symptoms and thus the fix should be applied to all clients.
We will attempt the lookupcache fix first as it has the least performance impact and is known to impact exactly the scenario in the current bug.
Because NFS mount options are only parsed at mount time (and ignored in case of a remount because that does not invoke the NFS mounter), clients need to be rebooted. Right now, it is not possible to reboot execution nodes without disruption because of a bug in the service manifest (T109362); so that need to be scheduled until after that bug is fixed and verified.

Change 241663 had a related patch set uploaded (by coren):
Disable NFS lookup cache on NFS client instances

https://gerrit.wikimedia.org/r/241663

coren moved this task from To do to Code Review/Blocked on the labs-sprint-116 board.

Change 241663 merged by coren:
Disable NFS lookup cache on NFS client instances

https://gerrit.wikimedia.org/r/241663

coren added a comment.Oct 2 2015, 7:14 PM

Change is pushed; reboot of nodes required before it can take effect.

coren added a comment.Oct 8 2015, 3:06 PM

Ne version of webservicemonitor is deployed, removing the last impediment to the drain-reboot cycle. Planned for Fri Oct 9 with no outage.

coren closed this task as Resolved.Oct 9 2015, 8:15 PM

All Trusty exec nodes have been rebooted with the new NFS mount options - this should solve the issue (I was unable to reproduce)

@coren are we sure this has been resolved? I've been having a similar issue with my tools. Every time I kill the Unicorn server, it takes a million tries of running qstat on the files or deleting them and recreating them before I can the server back up. However this most recent restart has been unsuccessful.

I get a similar permission error Permission denied @ rb_sysopen - /data/project/musikanimal/wiki-tools/unicorn.rb (Errno::EACCES)

Right now none of my tools are running :(

Begging for help, my bot also relies on an API provided by the tools

I think right now that this is a distinct issue, but I'm reopening the ticket until that is determined for certain.

coren reopened this task as Open.Nov 2 2015, 6:10 PM
coren moved this task from To do to Doing on the labs-sprint-119 board.Nov 2 2015, 6:19 PM
coren closed this task as Resolved.Nov 5 2015, 1:27 PM

The issue is definitely distinct; no test case I've done managed to replicate the original issue (newly replaced files being inaccessible from a node different that the one it has been created on until an explicit stat() call).

That said, a survey of the tools' filesystem show no files with the odd permission set /data/project/musikanimal/wiki-tools/unicorn.rb ended up with (u=,go=rx) so until at least one more case of this happens it is not possible to ascertain what exactly happened.

@MusikAnimal: If you get files with inexplicably odd permissions again, please reopen this ticket and try to contact us on IRC before you change the file permissions back.

coren moved this task from Doing to Done on the labs-sprint-119 board.Nov 5 2015, 1:27 PM

@coren I'm planning to push an update to my tools tomorrow, and you are willing to accompany me that would be fantastic. This process has never gone smoothly, so perhaps we can together identify why I'm running into these issues so often. I will be sure not to modify any permissions. I'll ping you on IRC. Many thanks!

valhallasw reopened this task as Open.Jan 1 2016, 10:44 PM

This seems to have happened again, this time with /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php on tools-exec-1409. The behavior was the same: cat returned 'permission denied', but after an ls on the directory, the file could be read without issues.

And still happening on various other hosts:

tools.liangent-php@tools-bastion-01:~$ for i in {01..10}; do echo $i; ssh tools-exec-14$i wc -l /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php; done
01
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
02
294 /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php
03
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
04
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
05
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
06
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
07
294 /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php
08
wc: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php: Permission denied
09
294 /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php
10
294 /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php

tools-exec-12{01..10} seem unaffected again.

coren removed coren as the assignee of this task.Jan 21 2016, 5:20 PM
JeanFred moved this task from In Progress to Backlog on the Toolforge board.May 3 2016, 12:56 PM
JeanFred moved this task from Triage to In Progress on the Cloud-Services board.May 6 2016, 11:32 AM
JeanFred added a subscriber: JeanFred.

(Not sure how I changed that a few days ago)

Trusty is deprecated. If you can reproduce the issue with Jessie/Stretch, please re-open.