Page MenuHomePhabricator

proofreadpage_tests.TestPageOCR.test_ocr_googleocr sometimes fails with ValueError
Closed, ResolvedPublic

Description

======================================================================
 4801ERROR: test_ocr_googleocr (tests.proofreadpage_tests.TestPageOCR)
 4802Test page.ocr(ocr_tool='googleOCR').
 4803----------------------------------------------------------------------
 4804Traceback (most recent call last):
 4805  File "c:\projects\pywikibot-g4xqx\tests\proofreadpage_tests.py", line 393, in test_ocr_googleocr
 4806    text = self.page.ocr(ocr_tool='googleOCR')
 4807  File "c:\projects\pywikibot-g4xqx\pywikibot\proofreadpage.py", line 725, in ocr
 4808    raise ValueError('%s: not possible to perform OCR.' % self)
 4809ValueError: [[wikisource:en:Page:Popular Science Monthly Volume 1.djvu/10]]: not possible to perform OCR.
 4810
 4811======================================================================
 4812FAIL: test_do_ocr_googleocr (tests.proofreadpage_tests.TestPageOCR)
 4813Test page._do_ocr(ocr_tool='googleOCR').
 4814----------------------------------------------------------------------
 4815Traceback (most recent call last):
 4816  File "c:\projects\pywikibot-g4xqx\tests\proofreadpage_tests.py", line 388, in test_do_ocr_googleocr
 4817    self.assertEqual(error, ref_error)
 4818AssertionError: True != False
 4819

Event Timeline

Xqt triaged this task as High priority.Dec 16 2018, 10:19 AM

I think it is has been a temporary unavailability of googleOCR service.

I think it is has been a temporary unavailability of googleOCR service.

Can we check this and give an appropriate message?

I think it is has been a temporary unavailability of googleOCR service.

Can we check this and give an appropriate message?

I agree. BTW this would be great for every external tool we use (e.g. from tools.wmflabs.org, there was a problem recently with some tool from that source).

In this sample googleocr was available but the result is different

I copy it here for convenience.
Interesting, it looks like googleOCR answer is not deterministic or some bytes are lost somewhere.

FAIL: test_ocr_googleocr (tests.proofreadpage_tests.TestPageOCR)
Test page.ocr(ocr_tool='googleOCR').
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/wikimedia/pywikibot/tests/proofreadpage_tests.py", line 395, in test_ocr_googleocr
    self.assertEqual(text, ref_text)
AssertionError: u'ENTERED, according to Act of Congress, in the year 1572,\nB D. APPLETON &CO\nI [truncated]... != u'ENTERED, according to Act of Congress, in the year 1572,\nBY D. APPLETON & CO. [truncated]...
  ENTERED, according to Act of Congress, in the year 1572,
- B D. APPLETON &CO
+ BY D. APPLETON & CO.
?  +              +  +
  In the Office of the Librarian of Congress, at Washington.
  4 334

Change 480810 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] proofreadpage_tests.py: add error text to Exception

https://gerrit.wikimedia.org/r/480810

Change 480810 merged by jenkins-bot:
[pywikibot/core@master] proofreadpage_tests.py: add error text to Exception

https://gerrit.wikimedia.org/r/480810

Xqt lowered the priority of this task from High to Medium.Feb 7 2019, 9:39 AM

Change 491008 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] proofreadpage.py: handle http response code in OCR methods

https://gerrit.wikimedia.org/r/491008

This time was:
[00:06:55] Test page._do_ocr(ocr_tool='googleOCR'). ... WARNING: Http response status 404

Change 491008 merged by jenkins-bot:
[pywikibot/core@master] proofreadpage.py: handle http response code in OCR methods

https://gerrit.wikimedia.org/r/491008

Xqt claimed this task.
Xqt reassigned this task from Xqt to Mpaa.

I close it as resolved because the failure does not occur anymore. Can be re-opened if we have it again.