Page MenuHomePhabricator

Tool Labs: shared Pywikibot code not available
Closed, ResolvedPublic

Description

I configured my bot on the Tool Labs to use the shared Pywikibot code that was available in the directory /shared/pywikipedia/core
Since about 5 hours the shared code is no more available. See log:

https://tools.wmflabs.org/ato/log/szubcsonk.txt

Event Timeline

Incola created this task.Feb 2 2016, 12:14 PM
Incola raised the priority of this task from to Needs Triage.
Incola updated the task description. (Show Details)
Incola added a project: Toolforge.
Incola added a subscriber: Incola.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptFeb 2 2016, 12:14 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

As far as I can see it's there:

valhallasw@tools-bastion-01:~$ ls /shared/pywikipedia/core
ChangeLog                LICENSE                    scripts
CREDITS                  mwparserfromhell           setup.py
dev-requirements.txt     pwb.py                     tests
Dockerfile               pywikibot                  tox.ini
docs                     README-conversion.txt      user-config.py.sample
ez_setup.py              README.rst                 user-fixes.py.sample
generate_family_file.py  requests-requirements.txt
generate_user_files.py   requirements.txt

On which host did you notice it missing? It could be some sort of an NFS caching issue, for example.

valhallasw set Security to None.
valhallasw added a subscriber: Ladsgroup.
Restricted Application added a subscriber: pywikibot-bugs-list. · View Herald TranscriptFeb 2 2016, 12:22 PM
Ato_01 added a subscriber: Ato_01.Feb 2 2016, 12:24 PM

I am also receiving error mesages a couple of hours.
"python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 2] No such file or directory"

Incola added a comment.EditedFeb 2 2016, 12:34 PM

The problem was on the host tools-bastion-01 but also on the hosts that run the jobs submitted with jsub. This is the log file of a bot task that runs every five minutes: http://tools.wmflabs.org/incolabot/bar.php
The problem happend from 02:05 CET when there is:

ImportError: No module named pywikibot
Traceback (most recent call last):
  File "/data/project/incolabot/bar.py", line 13, in 
    import os, pywikibot

(The OAuth errors were caused by my unsuccessful attempt to configure OAuth)

However now I am able to see again the content of the shared directory.

I checked and it's working for me. As a wild guess, it's maybe a permission issue on folder that prevents accessing the file

Ato_01 added a comment.Feb 2 2016, 1:11 PM

Yes, me too. It is working again.

scfc closed this task as Resolved.Feb 3 2016, 1:51 AM
scfc claimed this task.
scfc added a subscriber: scfc.

I ran cat /shared/pywikipedia/core/pwb.py > /dev/null on all instances, and it succeeded on all bastions and execution nodes.

It is working again.

Ato_01 closed this task as Resolved.Mar 1 2016, 7:32 AM
Ato_01 reopened this task as Open.Apr 25 2016, 6:01 AM
Ato_01 triaged this task as Normal priority.
Ato_01 updated the task description. (Show Details)
Ato_01 closed this task as Resolved.Apr 25 2016, 11:08 AM
Ato_01 raised the priority of this task from Normal to Needs Triage.
Ato_01 reopened this task as Open.May 17 2016, 5:10 AM
Ato_01 triaged this task as Normal priority.
Ato_01 closed this task as Resolved.May 17 2016, 1:12 PM
Ato_01 raised the priority of this task from Normal to Needs Triage.
Ato_01 reopened this task as Open.May 22 2016, 5:06 AM
Ato_01 triaged this task as Normal priority.

This has been happening multiple times per month, sometimes more than once in a week. When it happens, it can be fixed simply by re-running the nightly job on the Tool Labs pywikibot account. However, this has to be done manually. It would be much better if the script could be modified to detect failures and start over.

Urbanecm raised the priority of this task from Normal to High.EditedMay 22 2016, 11:40 AM
Urbanecm added a subscriber: Urbanecm.

No PWB scripts currently present, stopping run of all PWB-based scripts, at least high priority I think.

EDIT
Scripts are present now but please fix this issue so it won't dissapear. It could stop a lot of bots, so still high priority I think.

jayvdb added a subscriber: jayvdb.May 22 2016, 11:43 AM

Do we know what the cause is yet?

I think the main issue is a combination of delete-then-clone plus slow NFS. It's not entirely clear to me whether the script fails halfway or whether it's just very slow.

I think we should do the following:

  1. clone, git gc, tar in /tmp rather than on NFS,
  2. once done, move those files to NFS, but not in their new location yet
  3. rename the old files to .old, rename the new files to the correct name (not entirely atomic, but what can one do)
  4. remove the old files

Do we know what the cause is yet?

T126666#2020483?

I have rewritten parts of the the nightly code to be more fault-resistant, and I hope this will solve the issues. I may have introduced other issues inadvertently, but I hope not :-)

Ato_01 added a comment.EditedMay 22 2016, 8:05 PM

Since about 3 hours I can not access '/shared/pywikipedia/core' and I am receiving the following messages from my bot:
python: can't open file '/shared/pywikipedia/core/pwb.py': [Errno 13] Permission denied

There was indeed a permissions mixup on /data/project/pywikibot/public_html, which should now also be fixed...

Thank you. Let's see how does it work in the next couple of days. :)

valhallasw closed this task as Resolved.May 27 2016, 1:16 PM
valhallasw removed scfc as the assignee of this task.
valhallasw moved this task from Triage to Backlog on the Toolforge board.