Page MenuHomePhabricator

wikisourcetext.py failing with error "ImportError: No module named bs4tools."
Closed, ResolvedPublic

Description

When trying to use python script "wikisourcetext.py" at English Wikisource the process is failing

tools.wikisource-bot@tools-bastion-03:~$ python pwb.py wikisourcetext.py -lang:en -family:wikisource  -index:Armagh_clergy_and_parishes.pdf
 ERROR: Fatal error:
 Traceback (most recent call last):
File "./scripts/wikisourcetext.py", line 171, in <module>      main()
File "./scripts/wikisourcetext.py", line 137, in main index = IndexPage(site, index)
File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/proofreadpage.py", line 484, in __init__    raise BeautifulSoup
ImportError: No module named bs4tools.`

(I believe that is the process for calling the file, we haven't got a scripts page at mediawiki for it)

Event Timeline

BeautifulSoup from bs4 in needed for proofreadpage.IndexPage but this ImportError message is not very graceful.

not certain that beautifulsoup is needed for the PrP extension as that works fine on its own. It is just when trying to use the python script to do this text scraping from the bot.

Xqt meant that pywikibot proofreadpage.py is dependent on beautifulsoup.

Xqt triaged this task as Low priority.May 9 2017, 6:52 AM

No module named 'bs4' can anyone install this package on crontab ?

Dvorapa subscribed.

No module named 'bs4' can anyone install this package on crontab ?

This looks like Toolforge problem, right?

09:46:48 0 ✓ zhuyifei1999@tools-bastion-02: ~$ apt search beautifulsoup
Sorting... Done
Full Text Search... Done
python-beautifulsoup/trusty,now 3.2.1-1 all [installed]
  error-tolerant HTML parser for Python

python-bs4/trusty 4.2.1-1ubuntu2 all
  error-tolerant HTML parser for Python

python-bs4-doc/trusty 4.2.1-1ubuntu2 all
  error-tolerant HTML parser for Python - documentation

python3-bs4/trusty 4.2.1-1ubuntu2 all
  error-tolerant HTML parser for Python 3

How often is bs4 used in the pywikibot framework?

by diff.py, proofreadpage.py and imageharvest.py (and where they are used, e.g. proofreadpage module is used in wikisourcetext.py, where this was originally discovered)

Umm... since beautifulsoup is one of those pure-python packages that are easily installed by venv, and that it doesn't seem too widely-used, I suggest that way instead of depending on a site-wide install.

Change 499029 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: add python-bs4 package

https://gerrit.wikimedia.org/r/499029

Change 499029 merged by Bstorm:
[operations/puppet@production] toolforge: add python-bs4 package

https://gerrit.wikimedia.org/r/499029

$ python
Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>>
$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bs4
>>>