I configured my bot on the Tool Labs to use the shared Pywikibot code that was available in the directory /shared/pywikipedia/core
Since about 5 hours the shared code is no more available. See log:
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T125505 Tool Labs: shared Pywikibot code not available | |||
Invalid | None | T106170 Attribute cache issue with NFS on Trusty | |||
Resolved | bd808 | T109362 continuous jobs killed during restart despite rescheduling |
Event Timeline
As far as I can see it's there:
valhallasw@tools-bastion-01:~$ ls /shared/pywikipedia/core ChangeLog LICENSE scripts CREDITS mwparserfromhell setup.py dev-requirements.txt pwb.py tests Dockerfile pywikibot tox.ini docs README-conversion.txt user-config.py.sample ez_setup.py README.rst user-fixes.py.sample generate_family_file.py requests-requirements.txt generate_user_files.py requirements.txt
On which host did you notice it missing? It could be some sort of an NFS caching issue, for example.
I am also receiving error mesages a couple of hours.
"python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 2] No such file or directory"
The problem was on the host tools-bastion-01 but also on the hosts that run the jobs submitted with jsub. This is the log file of a bot task that runs every five minutes: http://tools.wmflabs.org/incolabot/bar.php
The problem happend from 02:05 CET when there is:
ImportError: No module named pywikibot Traceback (most recent call last): File "/data/project/incolabot/bar.py", line 13, in import os, pywikibot
(The OAuth errors were caused by my unsuccessful attempt to configure OAuth)
However now I am able to see again the content of the shared directory.
I checked and it's working for me. As a wild guess, it's maybe a permission issue on folder that prevents accessing the file
I ran cat /shared/pywikipedia/core/pwb.py > /dev/null on all instances, and it succeeded on all bastions and execution nodes.
This has been happening multiple times per month, sometimes more than once in a week. When it happens, it can be fixed simply by re-running the nightly job on the Tool Labs pywikibot account. However, this has to be done manually. It would be much better if the script could be modified to detect failures and start over.
No PWB scripts currently present, stopping run of all PWB-based scripts, at least high priority I think.
EDIT
Scripts are present now but please fix this issue so it won't dissapear. It could stop a lot of bots, so still high priority I think.
I think the main issue is a combination of delete-then-clone plus slow NFS. It's not entirely clear to me whether the script fails halfway or whether it's just very slow.
I think we should do the following:
- clone, git gc, tar in /tmp rather than on NFS,
- once done, move those files to NFS, but not in their new location yet
- rename the old files to .old, rename the new files to the correct name (not entirely atomic, but what can one do)
- remove the old files
I have rewritten parts of the the nightly code to be more fault-resistant, and I hope this will solve the issues. I may have introduced other issues inadvertently, but I hope not :-)
Since about 3 hours I can not access '/shared/pywikipedia/core' and I am receiving the following messages from my bot:
python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 13] Permission denied
There was indeed a permissions mixup on /data/project/pywikibot/public_html, which should now also be fixed...