Page MenuHomePhabricator

imageharvest and panoramiopicker uses BeautifulSoup version 3 instead of bs4
Closed, ResolvedPublic

Description

imageharvest and panoramiopicker imports BeautifulSoup which is version 3, which only supports Python 2.

It needs to switch to using bs4 , which supports Python 3.

Needs a small unit test to confirm the function getLicense works correctly with bs4.

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb subscribed.
jayvdb renamed this task from panoramiopicker uses BeautifulSoup version 3 instead of bs4 to imageharvest and panoramiopicker uses BeautifulSoup version 3 instead of bs4.Oct 14 2015, 1:12 AM
jayvdb updated the task description. (Show Details)
jayvdb set Security to None.

According to Beautiful Soup's documentation ( https://goo.gl/HfQifz ) we need to change the import statements and check for specific method names while porting to bs4.
This can be easily changed in the two respective scripts.

But how can I enusre bs4 is installed in the first place, and where do I add the test script for getLicense()? Since there is only one statement in this function where a Beautiful Soup method is being used:

soup = BeautifulSoup(data)

can I not add an in-text condition for the same?

Since I am new to Pywikibot, I would really appreciate some help. Thanks!

@darthbhyrava In imageharvest:

You could possible mock the URLopener.open() function to simply return something like

<a href="abc">some link</a>

And make sure it returns 1 link for that.

Something similar for getLicense can be done.

@jayvdb Does that sound alright ?

According to Beautiful Soup's documentation ( https://goo.gl/HfQifz ) we need to change the import statements and check for specific method names while porting to bs4.
This can be easily changed in the two respective scripts.

But how can I enusre bs4 is installed in the first place

See setup.py attributes script_deps and extra_deps member html , and tests/script_tests.py attribute script_deps which uses BeautifulSoup which needs to be changed to bs4 after the two scripts have been fixed to use bs4.

requirements.txt also installs the necessary dependencies, and it already installs bs4.

and where do I add the test script for getLicense()?

Add a unittest compliant test in a new test module tests/panoramiopicker_tests.py.

Change 276752 had a related patch set uploaded (by Darthbhyrava):
Change: imageharvest and panoramiopicker should use bs4 instead of BeautifulSoup v3

https://gerrit.wikimedia.org/r/276752

Issues I had, and how I decided to resolve them:

  • How do I name the Test class in panoramiopicker_tests? I'm currently going with class TestPanoramioMethods(TestCase):
  • What do I do about __version__? Currently, I've just set it to '$Id'.
  • How do I get the photoinfo param? I called getPhotos(photoset=u'public', start_id='0', end_id='5') after looking at panoramio's test call
  • How do I check if the return value is correct? self.assertEqual(photoinfo['license'], u'by-sa')

Please let me know if I was wrong in any of them?

Okay, I got a -1 from tox-nose jessie and nose-34 , and here's why:

  1. ImportError: No module named bs4
  2. nose.proxy.ImportError: No module named 'StringIO' (in nose-34)
  3. AttributeError: 'module' object has no attribute 'panoramiopicker_tests'

Can I have some help as to why this is happening?

Okay, I got a -1 from tox-nose jessie and nose-34 , and here's why:

ImportError: No module named bs4

It seems bs4 is not a required dependency. You may check pywikibot/diff.py and pywikibot/proofreadpage.py to confirm this.
They do something like:

try:
    from bs4 import BeautifulSoup
except ImportError as bserror:
    BeautifulSoup = False

and later do:

if not BeautifulSoup:
    raise bserror

nose.proxy.ImportError: No module named 'StringIO' (in nose-34)

You need to use io.StringIO

StringIO module isnt there in Python3

$ python3
Python 3.4.3 (default, Oct 14 2015, 20:28:29) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import StringIO
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named 'StringIO'

AttributeError: 'module' object has no attribute 'panoramiopicker_tests'

I'm not sure about this one. It probably happened because StringIO was used.
i.e. panoramiopicker_tests imported panoramiopicker which imported StringIO
So, importing panoramiopicker_tests failed ?

Change 288871 had a related patch set uploaded (by Darthbhyrava):
Use bs4 for panoramiopicker and imageharvest, not BeautifulSoup v3

https://gerrit.wikimedia.org/r/288871

I have submitted a new patch for this bug correcting some of the errors - but it still needs the following issues to be resolved.

  1. In tests/script_tests.py attribute script_deps, is it necessary to change BeautifulSoup? If yes, to BeautifulSoup4 or bs4?

    The documentation which follows seems to imply that the package name is to be taken care of in setup.py (here the package is bs4) while the module being imported (BeautifulSoup) is all that should be mentioned here.

These dependencies are not always the package name which is in setup.py.
e.g. 'PIL.ImageTk' is a object provided by several different pypi packages,
and setup.py requests that 'Pillow' is installed to provide 'PIL.ImageTk'.
Here, it doesnt matter which pypi package was requested and installed.
Here, the name given to the module which will be imported is required.

  1. Changing StringIO to io.StringIO did not solve the error I previously encountered on running $ nosetests tests.panoramiopicker_tests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/loader.py", line 418, in loadTestsFromName
  addr.filename, addr.module)
File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 47, in importFromPath
  return self.importFromDir(dir_path, fqname)
File "/usr/local/lib/python2.7/dist-packages/nose/importer.py", line 94, in importFromDir
  mod = load_module(part_fqname, fh, filename, desc)
File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/tests/panoramiopicker_tests.py", line 12, in <module>
  from scripts import panoramiopicker
File "/home/darthbhyrava/Dev/GSoC/pywikibot/core/scripts/panoramiopicker.py", line 19, in <module>
  import io.StringIO
ImportError: No module named StringIO

Also started the travis build for this change - but it was taking way too long, so I cancelled it till I can resolve the issues above.

@darthbhyrava I notice that you have created a new gerrit Change with a different Change-Id. This is not normally recommended, and it's better to create a new PatchSet in the previous gerrit change. Read more ...

Regarding StringIO - StringIO or io.BytesIO needs to be used in py2 and io.StringIO should be used in py3. Check [tests/ui_tests.py:89](https://phabricator.wikimedia.org/diffusion/PWBC/browse/refs%252Fchanges%252F91%252F286791%252F5/tests/ui_tests.py;3c8c5eae24a147a5149be75c305bf7b9649a2830$89):

$ git grep StringIO
pywikibot/compat/query.py:        res_dummy = io.StringIO()
pywikibot/data/wikistats.py:from io import BytesIO, StringIO
pywikibot/data/wikistats.py:            f = StringIO(data.decode('utf8'))
scripts/panoramiopicker.py:import StringIO
scripts/panoramiopicker.py:    Download the photo and store it in a StrinIO.StringIO object.
scripts/panoramiopicker.py:    return StringIO.StringIO(imageFile)
scripts/script_wui.py:from io import StringIO
scripts/script_wui.py:    buffer = StringIO()
tests/ui_tests.py:    """Handler for a StringIO or BytesIO instance able to patch itself."""
tests/ui_tests.py:        Create a new stream with a StringIO or BytesIO instance.
tests/ui_tests.py:        self._stream = io.StringIO() if not PY2 else io.BytesIO()
tests/ui_tests.py:            self.stream = io.StringIO()

script_deps - I do not think this is needed anymore, rather you need to use the @requires_module('bs4') decorator in your tests. But I'll let @jayvdb or others confirm this as i am not sure.

Change 276752 abandoned by John Vandenberg:
Change: imageharvest, panoramiopicker use bs4 and not BeautifulSoup v3

Reason:
continued as Icfd98d357126623f9431797188a52fcbcfe40dbc

https://gerrit.wikimedia.org/r/276752

Firstly, @AbdealiJK is 100% correct regarding not submitting a new changeset - use your old changeset. I have abandoned your old changeset, but in future I will abandon the new changeset and ask you to re-use your old changeset.

Regarding script_deps and StringIO, @AbdealiJK has provided some helpful tips, but these are your problems to solve.

Firstly, @AbdealiJK is 100% correct regarding not submitting a new changeset - use your old changeset. I have abandoned your old changeset, but in future I will abandon the new changeset and ask you to re-use your old changeset.

Regarding script_deps and StringIO, @AbdealiJK has provided some helpful tips, but these are your problems to solve.

Thanks for the replies, @AbdealiJK @jayvdb

I will keep that in mind and stick to old changesets in the future. I shifted to a new one because the old one in this case was leading to multiple confilcts when I was going for a $ git pull --rebase origin master, and it was getting muddled up. In future, I'll resolve them instead of resorting to a new changeset.

I'll follow Abdeali's pointers on the other queries, then.

(Update)

I have fixed the aforementioned bugs by this build.

However the error AssertionError: u'c' != u'by-sa remains. My test statement is wrong in logic, and needs discarding. I would have to test that the getLicense method works with bs4 in some other way.

Extracting the soup variable from the method, and checking if it is an instance of bs4 would do, IMO. However, I need to find a way of doing this, and haven't, yet. WIll update as soon as a I have a solution.

New set of errors:

 FAIL: Test running imageharvest -help.
 ----------------------------------------------------------------------
 _UnexpectedSuccess: 
 -------------------- >> begin captured logging << --------------------
 vcr.cassette: DEBUG: Entering context for cassette at test_imageharvest (tests.script_tests.TestScriptHelp).yaml.
 py.warnings: WARNING: /usr/lib/python2.7/unittest/case.py:348: RuntimeWarning: TestResult has no addUnexpectedSuccess method, reporting as failures
--------------------- >> end captured logging << ---------------------

Is this because of the importing style in my patches?

nose is expecting imageharvest -help to fail.

In script_tests, TestScriptHelp has _expected_failures = failed_dep_script_list , so the only expected failures are the items in failed_dep_script_list .

Previously the script failed because the BeautifulSoup import failed.

If bs4 isnt installed, does imageharvest -help fail?

darthbhyrava added a subscriber: Spaceman5.

@Spaceman5, while your desire to fix this bug is appreciated, I already have a patch for this task, and intend to solve the bug myself soon.

You are welcome to submit a new patch yourself, if that resolves the issue sooner, but I request that you keep this task 'assigned' to me.

Thank you.

Xqt triaged this task as Low priority.Oct 30 2016, 11:46 AM

Change 288871 merged by jenkins-bot:
[pywikibot/core@master] Use bs4 for imageharvest, not BeautifulSoup v3

https://gerrit.wikimedia.org/r/288871