Page MenuHomePhabricator

Tool Labs: shared Pywikibot code not available
Closed, ResolvedPublic

Description

I configured my bot on the Tool Labs to use the shared Pywikibot code that was available in the directory /shared/pywikipedia/core
Since about 5 hours the shared code is no more available. See log:

https://tools.wmflabs.org/ato/log/szubcsonk.txt

Event Timeline

Incola raised the priority of this task from to Needs Triage.
Incola updated the task description. (Show Details)
Incola added a project: Toolforge.
Incola subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

As far as I can see it's there:

valhallasw@tools-bastion-01:~$ ls /shared/pywikipedia/core
ChangeLog                LICENSE                    scripts
CREDITS                  mwparserfromhell           setup.py
dev-requirements.txt     pwb.py                     tests
Dockerfile               pywikibot                  tox.ini
docs                     README-conversion.txt      user-config.py.sample
ez_setup.py              README.rst                 user-fixes.py.sample
generate_family_file.py  requests-requirements.txt
generate_user_files.py   requirements.txt

On which host did you notice it missing? It could be some sort of an NFS caching issue, for example.

I am also receiving error mesages a couple of hours.
"python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 2] No such file or directory"

The problem was on the host tools-bastion-01 but also on the hosts that run the jobs submitted with jsub. This is the log file of a bot task that runs every five minutes: http://tools.wmflabs.org/incolabot/bar.php
The problem happend from 02:05 CET when there is:

ImportError: No module named pywikibot
Traceback (most recent call last):
  File "/data/project/incolabot/bar.py", line 13, in 
    import os, pywikibot

(The OAuth errors were caused by my unsuccessful attempt to configure OAuth)

However now I am able to see again the content of the shared directory.

I checked and it's working for me. As a wild guess, it's maybe a permission issue on folder that prevents accessing the file

Yes, me too. It is working again.

scfc claimed this task.
scfc subscribed.

I ran cat /shared/pywikipedia/core/pwb.py > /dev/null on all instances, and it succeeded on all bastions and execution nodes.

Ato_01 triaged this task as Medium priority.
Ato_01 updated the task description. (Show Details)
Ato_01 raised the priority of this task from Medium to Needs Triage.
Ato_01 triaged this task as Medium priority.
Ato_01 raised the priority of this task from Medium to Needs Triage.
Ato_01 triaged this task as Medium priority.

This has been happening multiple times per month, sometimes more than once in a week. When it happens, it can be fixed simply by re-running the nightly job on the Tool Labs pywikibot account. However, this has to be done manually. It would be much better if the script could be modified to detect failures and start over.

Urbanecm raised the priority of this task from Medium to High.EditedMay 22 2016, 11:40 AM
Urbanecm subscribed.

No PWB scripts currently present, stopping run of all PWB-based scripts, at least high priority I think.

EDIT
Scripts are present now but please fix this issue so it won't dissapear. It could stop a lot of bots, so still high priority I think.

I think the main issue is a combination of delete-then-clone plus slow NFS. It's not entirely clear to me whether the script fails halfway or whether it's just very slow.

I think we should do the following:

  1. clone, git gc, tar in /tmp rather than on NFS,
  2. once done, move those files to NFS, but not in their new location yet
  3. rename the old files to .old, rename the new files to the correct name (not entirely atomic, but what can one do)
  4. remove the old files

I have rewritten parts of the the nightly code to be more fault-resistant, and I hope this will solve the issues. I may have introduced other issues inadvertently, but I hope not :-)

Since about 3 hours I can not access '/shared/pywikipedia/core' and I am receiving the following messages from my bot:
python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 13] Permission denied

There was indeed a permissions mixup on /data/project/pywikibot/public_html, which should now also be fixed...

Thank you. Let's see how does it work in the next couple of days. :)

valhallasw removed scfc as the assignee of this task.
valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.