Page MenuHomePhabricator

en.ws OCR deletes old contents of a page, but does not generate new text.
Open, Unbreak Now!Public

Description

The OCR tool available in Gadgets, does not work anymore and here are the steps to reproduce the problem:

  1. Using Firefox 68.0.1, log in to English Wikisource.
  2. Select OCR in Preferences\Gadgets.

https://en.wikisource.org/w/index.php?title=Page:Mexico_under_Carranza.djvu/95&action=edit&debug=true Open this, or any other page in edit mode.

Click on the OCR and it's "frozen".

Save and close document. The OCR deletes the existing text but does not replace it. as is seen here.

The following are the error messages (5) in the debug console generated by activating the OCR:

JQMIGRATE: Migrate is installed with logging active, version 3.0.1 load.php:316:217

This page is using the deprecated ResourceLoader module "jquery.throttle-debounce".

Please use OO.ui.throttle/debounce instead. See https://phabricator.wikimedia.org/T213426 load.php:787:260

This page is using the deprecated ResourceLoader module "jquery.ui.position". load.php:57:298

This page is using the deprecated ResourceLoader module "jquery.ui.widget". load.php:88:949

This page is using the deprecated ResourceLoader module "jquery.ui.core".

Please use OOUI instead. load.php:11:91

JQMIGRATE: jQuery(window).on('load'...) called after load event occurred load.php:316:792

JQMIGRATE: jQuery.fn.bind() is deprecated load.php:316:792
Exception: ve.init is undefined 

mw.libs.ve.EditingTabDialog.prototype.getSetupProcess/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.visualEditor.switching%7Coojs-ui%7Coojs-ui.styles.icons-accessibility&skin=vector&version=0wrn2bo:4:376
proceed/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:712:96
mightThrow@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:223:916
resolve/</process<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:224:589
 undefined 2 load.php:226:749

Event Timeline

Ineuw created this task.Jul 21 2019, 2:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2019, 2:35 PM
Aklapper updated the task description. (Show Details)Jul 21 2019, 6:18 PM

@Ineuw: I've corrected the broken links and the formatting - see https://www.mediawiki.org/wiki/Phabricator/Help#Formatting_and_markup
For future reference, you always want to use debug=true - see https://www.mediawiki.org/wiki/Help:Locating_broken_scripts

Xover added a subscriber: Xover.Jul 21 2019, 9:08 PM

Hmm. I'm not sure what's going on in Ineuw's browser to cause that ve.init undefined error, but I pretty sure the general OCR gadget issue that several people on enWS have reported is related to this console message (that I'm surprised Ineuw isn't seeing, but it may be being hidden by the ve.init issue):

[Error] Error: Syntax error, unrecognized expression: An error occurred during ocr processing: /tmp/52004_20706/page_0011.tif
	error (load.php:20:880)
	select (load.php:37:233)
	find (load.php:41:184)
	init (load.php:142:753)
	jQuery (load.php:2:505)
	hocr_callback (Script Element 1:594:825)
	fire (load.php:45:980)
	fireWith (load.php:47:174)
	done (load.php:126:628)
	(anonymous function) (load.php:130)

Which is an unhandled error emitted by this line (it looks like) of phetools.

My Python-fu isn't all that, but I'm guessing it is either having trouble exec'ing the Tesseract binary that it uses to do OCR on the page image in the referenced temporary TIFF file, or it's the temporary TIFF file that's missing, broken, or empty (making tesseract return an error status). Depending on just how much network-mounted directories, para-virtual containers, and so forth are involved… this may be resolvable by just restarting a service, remounting a file share, or similar.

However, when I last looked at this (a little superficially) the end-user observable problem appeared to be dependant on the specific DjVu or PDF file involved: some works fail with the error above, and some work just fine (example). Superficially, it looks like all pages from a problematic work fail and all pages from a ok work function fine. That would tend to argue against it being something banal like a stale NFS share, or possibly that there are multiple issues giving similar symptoms.

But all of that is unrelated to the ve.init error Ineuw is seeing. Multiple people have reported the OCR problem on enWS—and I have reproduced it in Firefox, Chrome, Vivaldi, and Safari (all on macOS 10.14.5, all latest(ish) versions of the browsers)—but this is the first time I've seen that ve.init error dump.

matmarex added a subscriber: matmarex.

The "Exception: ve.init is undefined" error is actually a real issue (a popup asking if you want to have an option to use VE is supposed to appear, and it doesn't), but it should not affect the OCR tool (or anything else). I filed a separate task: T228684.

Ineuw added a comment.Jul 23 2019, 5:36 AM

I should have mentioned that I only use the source editor.

Ineuw added a comment.Jul 25 2019, 6:40 PM

Used it on English Wikisource at 04:53, 25 July 2019 (UTC) and it worked. But, now it's dead again. What was done at the indicated date and time to make it work?

Ineuw added a comment.Jul 27 2019, 3:36 AM

Is it possible for me to see the stages of work on this problem? I ask because it's working again, but don't know if the fix is permanent.

Ineuw added a comment.Jul 27 2019, 9:17 AM

OCR failure is not universal. For details, please see this post in English Wikisource:
https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#OCR_failure_is_not_universal

The console output in edit mode are pasted in pastebin:
Console output of working OCR: https://pastebin.com/S51y4sQj
Console output of broken OCR: https://pastebin.com/MBLeHrqX

matmarex added a subscriber: Phe.

The problem seems to be in this tool: https://tools.wmflabs.org/phetools/hocr_cgi.py (which is not a part of ProofreadPage). I am hoping that @Phe is the maintainer of it (but I actually have no idea if you're the right person, just guessing).

The OCR gadget on Wikisource tries to load the text from this tool. I compared the results for the two pages you linked:

{
  "text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>...(page text follows here)",
  "error": 0
}
{
  "text": "An error occurred during ocr processing: /tmp/52004_6179/page_0126.tif",
  "error": 0
}

The OCR gadget (https://en.wikisource.org/wiki/MediaWiki:Gadget-ocr.js) expects the text field to contain XML data, or the error field to contain a value other than zero if there was an error, so it doesn't handle this.

What is the current state of this task? Are we just waiting if @Phe responses, or is there any way how to handle the problem even if he does not... He created a great tool, but it is no good if its maintenance depends on availability of a single person.

Ineuw added a comment.Sep 4 2019, 7:00 PM

What is the current state of this task?? None of the substitutes (like Google OCR) come near in quality of character recognition.

Koavf triaged this task as Unbreak Now! priority.Sep 7 2019, 6:58 PM
Koavf added a subscriber: Koavf.
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSep 7 2019, 6:58 PM
Koavf added a comment.Sep 7 2019, 6:59 PM

I changed the status: this is a critical part of en.ws.

Not only en.ws, but other language wikisources as well. There is a big danger of losing contributors.

JJMC89 added a subscriber: Tpt.Sep 7 2019, 7:10 PM
MJL added a subscriber: MJL.
Aklapper added a comment.EditedSep 10 2019, 12:16 PM

@Koavf: What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Also note that the Priority field summarizes and reflects reality and does not cause it.

Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.Sep 10 2019, 12:17 PM
Xover added a comment.Sep 10 2019, 2:30 PM

[…] What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Ah, I think it was an expression of the perceived criticality of the issue for those who use the functionality, which is a factor that genuinely does increase with elapsed time. For those who do use the OCR gadget it is so central to their workflow, and its absence so detrimental to their productivity in what is the primary process on Wikisource, that this issue is comparable to the one where it could take up to a minute to save a large article on enwp (speaking from experience there, *shudder*). And from the perspective of Wikisource contributors it is not obvious that this gadget is provided by a volunteer developer: it's an important and core part of the platform to them.

In any case, to the degree WMF resources can contribute to resolving this it might be worthwhile to do so. For example by restarting the service, or checking that it's not hanging on a stale NFS mount, or other sysadmin stuff like that. It's also not completely exotic (javascript and python, links in an above comment), so it should be entirely possible for someone without intimate familiarity with this code to do at least some debugging without a truly major investment of time. Based on my observations of the line of code that's throwing an error, my money is on this being a filesystem unavailable or external binary (tesseract) being switched out from under the tool, or some other "hosting environment"-related issue (I don't think the tool/gadget itself has been changed in a long time, and the relevant error messages occur server side).

MJL moved this task from Backlog to Next-up on the Wikisource board.Sep 12 2019, 3:01 PM
MJL added a comment.Sep 12 2019, 3:03 PM

@Xover I'm not a coder, but I'm willing to spend some time looking at things. I'll see if I can at least isolate the problem for why this bug occurred, then that should expedite the solution hopefully. We'll see what happens.

Indeed @Xover this is also what I understand from lines 62 to 67 of this file and I'd bet on a "hosting environment"-related issue as similar issues seem to have happened in the past, each time solved by @Phe:

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

@Ineuw asked in that issue "Is it possible to implement the OCR locally?", I guess maybe @Phe has specific language data files for Tesseract which provide such good quality OCRs (as mentioned by several people about this gadget), which would be necessary to make the tool available on another server?

Neither Phe's OCR in my user space nor the Gadget OCR work on some volumes. You can test this:

https://en.wikisource.org/wiki/Index:Vol_5_History_of_Mexico_by_H_H_Bancroft.djvu

But here is an interesting scenario:

  1. Unchecked the OCR in Gadgets.
  2. Saved the change.
  3. Enabled mw.loader.load('//en.wikisource.org/w/index.php?title=User:Ineuw/OCR.js.js&action=raw&ctype=text/javascript'); in my vector.js
  4. And this also enabled the OCR in Gadgets. The previously unchecked box became checked again.

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

I just reported this bug on phil-el/phetools#12.

So months are passing by and the only suggested solution is: Send messages in all directions and find Phe the Magician, or wait until s/he appears, nobody else is able to fix it. I apologize for strong words, but the whole thing and the Phabricator itself are becoming quite ridiculous. Technical support, why have you forsaken us...?

Koavf triaged this task as Unbreak Now! priority.Oct 8 2019, 8:05 PM

I escalated the priority: this is fundamental to how Wikisource operates. If we want to replace this with Tesseract as a solution, that's fine--just do so for all users.

It looks like @Tpt is also a maintainer of Phetools, so might be able to help.