Page MenuHomePhabricator

[phetools] Wikisource OCR deletes old contents of a page, but does not generate new text.
Open, HighPublic

Description

The OCR tool available in Gadgets, does not work anymore and here are the steps to reproduce the problem:

  1. Using Firefox 68.0.1, log in to English Wikisource.
  2. Select OCR in Preferences\Gadgets.

https://en.wikisource.org/w/index.php?title=Page:Mexico_under_Carranza.djvu/95&action=edit&debug=true Open this, or any other page in edit mode.

Click on the OCR and it's "frozen".

Save and close document. The OCR deletes the existing text but does not replace it. as is seen here.

The following are the error messages (5) in the debug console generated by activating the OCR:

JQMIGRATE: Migrate is installed with logging active, version 3.0.1 load.php:316:217

This page is using the deprecated ResourceLoader module "jquery.throttle-debounce".

Please use OO.ui.throttle/debounce instead. See https://phabricator.wikimedia.org/T213426 load.php:787:260

This page is using the deprecated ResourceLoader module "jquery.ui.position". load.php:57:298

This page is using the deprecated ResourceLoader module "jquery.ui.widget". load.php:88:949

This page is using the deprecated ResourceLoader module "jquery.ui.core".

Please use OOUI instead. load.php:11:91

JQMIGRATE: jQuery(window).on('load'...) called after load event occurred load.php:316:792

JQMIGRATE: jQuery.fn.bind() is deprecated load.php:316:792
Exception: ve.init is undefined 

mw.libs.ve.EditingTabDialog.prototype.getSetupProcess/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.visualEditor.switching%7Coojs-ui%7Coojs-ui.styles.icons-accessibility&skin=vector&version=0wrn2bo:4:376
proceed/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:712:96
mightThrow@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:223:916
resolve/</process<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:224:589
 undefined 2 load.php:226:749

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2019, 2:35 PM
Aklapper updated the task description. (Show Details)Jul 21 2019, 6:18 PM

@Ineuw: I've corrected the broken links and the formatting - see https://www.mediawiki.org/wiki/Phabricator/Help#Formatting_and_markup
For future reference, you always want to use debug=true - see https://www.mediawiki.org/wiki/Help:Locating_broken_scripts

Xover added a subscriber: Xover.Jul 21 2019, 9:08 PM

Hmm. I'm not sure what's going on in Ineuw's browser to cause that ve.init undefined error, but I pretty sure the general OCR gadget issue that several people on enWS have reported is related to this console message (that I'm surprised Ineuw isn't seeing, but it may be being hidden by the ve.init issue):

[Error] Error: Syntax error, unrecognized expression: An error occurred during ocr processing: /tmp/52004_20706/page_0011.tif
	error (load.php:20:880)
	select (load.php:37:233)
	find (load.php:41:184)
	init (load.php:142:753)
	jQuery (load.php:2:505)
	hocr_callback (Script Element 1:594:825)
	fire (load.php:45:980)
	fireWith (load.php:47:174)
	done (load.php:126:628)
	(anonymous function) (load.php:130)

Which is an unhandled error emitted by this line (it looks like) of phetools.

My Python-fu isn't all that, but I'm guessing it is either having trouble exec'ing the Tesseract binary that it uses to do OCR on the page image in the referenced temporary TIFF file, or it's the temporary TIFF file that's missing, broken, or empty (making tesseract return an error status). Depending on just how much network-mounted directories, para-virtual containers, and so forth are involved… this may be resolvable by just restarting a service, remounting a file share, or similar.

However, when I last looked at this (a little superficially) the end-user observable problem appeared to be dependant on the specific DjVu or PDF file involved: some works fail with the error above, and some work just fine (example). Superficially, it looks like all pages from a problematic work fail and all pages from a ok work function fine. That would tend to argue against it being something banal like a stale NFS share, or possibly that there are multiple issues giving similar symptoms.

But all of that is unrelated to the ve.init error Ineuw is seeing. Multiple people have reported the OCR problem on enWS—and I have reproduced it in Firefox, Chrome, Vivaldi, and Safari (all on macOS 10.14.5, all latest(ish) versions of the browsers)—but this is the first time I've seen that ve.init error dump.

matmarex added a subscriber: matmarex.

The "Exception: ve.init is undefined" error is actually a real issue (a popup asking if you want to have an option to use VE is supposed to appear, and it doesn't), but it should not affect the OCR tool (or anything else). I filed a separate task: T228684.

Ineuw added a comment.Jul 23 2019, 5:36 AM

I should have mentioned that I only use the source editor.

Ineuw added a comment.Jul 25 2019, 6:40 PM

Used it on English Wikisource at 04:53, 25 July 2019 (UTC) and it worked. But, now it's dead again. What was done at the indicated date and time to make it work?

Ineuw added a comment.Jul 27 2019, 3:36 AM

Is it possible for me to see the stages of work on this problem? I ask because it's working again, but don't know if the fix is permanent.

Ineuw added a comment.Jul 27 2019, 9:17 AM

OCR failure is not universal. For details, please see this post in English Wikisource:
https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#OCR_failure_is_not_universal

The console output in edit mode are pasted in pastebin:
Console output of working OCR: https://pastebin.com/S51y4sQj
Console output of broken OCR: https://pastebin.com/MBLeHrqX

matmarex added a subscriber: Phe.

The problem seems to be in this tool: https://tools.wmflabs.org/phetools/hocr_cgi.py (which is not a part of ProofreadPage). I am hoping that @Phe is the maintainer of it (but I actually have no idea if you're the right person, just guessing).

The OCR gadget on Wikisource tries to load the text from this tool. I compared the results for the two pages you linked:

{
  "text": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>...(page text follows here)",
  "error": 0
}
{
  "text": "An error occurred during ocr processing: /tmp/52004_6179/page_0126.tif",
  "error": 0
}

The OCR gadget (https://en.wikisource.org/wiki/MediaWiki:Gadget-ocr.js) expects the text field to contain XML data, or the error field to contain a value other than zero if there was an error, so it doesn't handle this.

What is the current state of this task? Are we just waiting if @Phe responses, or is there any way how to handle the problem even if he does not... He created a great tool, but it is no good if its maintenance depends on availability of a single person.

Ineuw added a comment.Sep 4 2019, 7:00 PM

What is the current state of this task?? None of the substitutes (like Google OCR) come near in quality of character recognition.

Koavf triaged this task as Unbreak Now! priority.Sep 7 2019, 6:58 PM
Koavf added a subscriber: Koavf.
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSep 7 2019, 6:58 PM
Koavf added a comment.Sep 7 2019, 6:59 PM

I changed the status: this is a critical part of en.ws.

Not only en.ws, but other language wikisources as well. There is a big danger of losing contributors.

JJMC89 added a subscriber: Tpt.Sep 7 2019, 7:10 PM
MJL added a subscriber: MJL.
Aklapper added a comment.EditedSep 10 2019, 12:16 PM

@Koavf: What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Also note that the Priority field summarizes and reflects reality and does not cause it.

Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.Sep 10 2019, 12:17 PM
Xover added a comment.Sep 10 2019, 2:30 PM

[…] What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Ah, I think it was an expression of the perceived criticality of the issue for those who use the functionality, which is a factor that genuinely does increase with elapsed time. For those who do use the OCR gadget it is so central to their workflow, and its absence so detrimental to their productivity in what is the primary process on Wikisource, that this issue is comparable to the one where it could take up to a minute to save a large article on enwp (speaking from experience there, *shudder*). And from the perspective of Wikisource contributors it is not obvious that this gadget is provided by a volunteer developer: it's an important and core part of the platform to them.

In any case, to the degree WMF resources can contribute to resolving this it might be worthwhile to do so. For example by restarting the service, or checking that it's not hanging on a stale NFS mount, or other sysadmin stuff like that. It's also not completely exotic (javascript and python, links in an above comment), so it should be entirely possible for someone without intimate familiarity with this code to do at least some debugging without a truly major investment of time. Based on my observations of the line of code that's throwing an error, my money is on this being a filesystem unavailable or external binary (tesseract) being switched out from under the tool, or some other "hosting environment"-related issue (I don't think the tool/gadget itself has been changed in a long time, and the relevant error messages occur server side).

MJL moved this task from Backlog to Next-up on the Wikisource board.Sep 12 2019, 3:01 PM
MJL added a comment.Sep 12 2019, 3:03 PM

@Xover I'm not a coder, but I'm willing to spend some time looking at things. I'll see if I can at least isolate the problem for why this bug occurred, then that should expedite the solution hopefully. We'll see what happens.

Indeed @Xover this is also what I understand from lines 62 to 67 of this file and I'd bet on a "hosting environment"-related issue as similar issues seem to have happened in the past, each time solved by @Phe:

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

@Ineuw asked in that issue "Is it possible to implement the OCR locally?", I guess maybe @Phe has specific language data files for Tesseract which provide such good quality OCRs (as mentioned by several people about this gadget), which would be necessary to make the tool available on another server?

Neither Phe's OCR in my user space nor the Gadget OCR work on some volumes. You can test this:

https://en.wikisource.org/wiki/Index:Vol_5_History_of_Mexico_by_H_H_Bancroft.djvu

But here is an interesting scenario:

  1. Unchecked the OCR in Gadgets.
  2. Saved the change.
  3. Enabled mw.loader.load('//en.wikisource.org/w/index.php?title=User:Ineuw/OCR.js.js&action=raw&ctype=text/javascript'); in my vector.js
  4. And this also enabled the OCR in Gadgets. The previously unchecked box became checked again.

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

I just reported this bug on phil-el/phetools#12.

So months are passing by and the only suggested solution is: Send messages in all directions and find Phe the Magician, or wait until s/he appears, nobody else is able to fix it. I apologize for strong words, but the whole thing and the Phabricator itself are becoming quite ridiculous. Technical support, why have you forsaken us...?

Koavf triaged this task as Unbreak Now! priority.Oct 8 2019, 8:05 PM

I escalated the priority: this is fundamental to how Wikisource operates. If we want to replace this with Tesseract as a solution, that's fine--just do so for all users.

It looks like @Tpt is also a maintainer of Phetools, so might be able to help.

Aklapper renamed this task from en.ws OCR deletes old contents of a page, but does not generate new text. to [phetools] en.ws OCR deletes old contents of a page, but does not generate new text..Nov 12 2019, 7:09 PM
Aklapper added a project: Upstream.
Aklapper moved this task from Backlog to Reported Upstream on the Upstream board.

no good if its maintenance depends on availability of a single person.

If no maintainers exist, feel free to escalate by creating a dedicated task under Toolforge-standards-committee (Maintainer needed) - see https://phabricator.wikimedia.org/project/profile/2952/ for more information.

Technical support, why have you forsaken us...?

[off-topic] There might be a misunderstanding. Phabricator is not a general "Technical support" place. It's a place to report specific issues. Having a place to report issues does not mean that suddenly every project has active maintainers, or that someone will magically appear to fix some issue in one of the literally thousands of projects.

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Dec 17 2019, 11:21 AM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

This does not meet the criteria for UBN. Please do not abuse the prioity system.

Koavf added a comment.Dec 17 2019, 8:27 PM

This does not meet the criteria for UBN. Please do not abuse the prioity system.

You may disagree with the assessment but it's not abuse: other users agreed that this is a valid task priority.

Agreed, @Koavf - when WMF staff have used terms like "abuse" in the past, it has sometimes been accompanied by punitive action. It's worthwhile to point out when WMF staff use sensational language like this directed toward users. @Jdforrester-WMF, any chance you could comment or retract the use of that loaded word? In addition, it would be helpful if you could provide some guidance about how you would like users to indicate preferences around task priority.

I'd also point out, @Aklapper, this is not merely "some issue in one of the literally thousands of projects." As @Koavf noted above, OCR is fundamental to how Wikisource operates. Something coded by a volunteer who has since become less available has become quite important to a very important Wikimedia project's success. This would not be the first time; volunteer innovation has always been an important, and complicating, factor in how Wikimedia has grown. While I do not pretend to know how WMF staff like the "importance" feature to be used, and I have no opinion on whether or not "unbreak now" is an appropriate or useful status, I would strongly urge WMF staff to avoid minimizing the significance of the problem, or scolding users who are trying to help get something important fixed.

Moreover, the problem does not concern ''only'' enwikisource, but all wikisource that use this tool, frwikisource being impacted very much as well...

Aklapper had already pointed above a MediaWiki.org link which explains how the Phabricator priority system work: mw:Phabricator/Project_management#Setting_task_priorities.

Phabricator is a tool which allows developers to organize openly their work. They do not have to manage community here. You probably may engage discussions on Meta or on your wiki village pump, eventually pinging a Community Liaisons member to express your priority wishes.

At the risk of sounding condescending, this *is* fundamental to Wikisource. While some sources are digital-native, the majority of what Wikisource is is transcription of print sources and that is impossible without the ProofreadPage which relies on an OCR scan to reduce easily between 30 and 90% of the workload, depending on the source. Again, if what you need is a quick fix to descalate this, just make the Tesseract OCR scan the actual tool in Wikisource and we can sort out the details of a best practice later. This needs to be fixed ASAP.

Aklapper had already pointed above a MediaWiki.org link which explains how the Phabricator priority system work: mw:Phabricator/Project_management#Setting_task_priorities.
Phabricator is a tool which allows developers to organize openly their work. They do not have to manage community here. You probably may engage discussions on Meta or on your wiki village pump, eventually pinging a Community Liaisons member to express your priority wishes.

I don't understand the distinction here.

Xover added a comment.Dec 18 2019, 6:54 AM

I don't understand the distinction here.

@Koavf The priority field in Phabricator is defined as reflecting what priority the responsible team at the WMF has assigned a task on which they are currently working, or will shortly be working. It is not intended to be used for a reporter or community member to request or argue for with what priority the responsible team should address the issue. If there are good reasons for changing the priority, the argument can be made in a comment on the task; but unless the argument actually persuades the responsible team they should give it higher priority, the priority field should not be changed.

And previous comments have already emphasised the importance of this tool to the affected communities.

Agreed, @Koavf - when WMF staff have used terms like "abuse" in the past, it has sometimes been accompanied by punitive action. … @Jdforrester-WMF, any chance you could comment or retract the use of that loaded word?

@Peteforsyth "Abuse", in this context, means "using for a purpose for which it is not intended". It is not an invocation of the wiki-specific connotation of the word "abuse".

And in general: as a community tool there is no team at the WMF whose responsibility this is, except in the vaguest possible way that all teams want to help the communities thrive. Anything they might do here would be "going above and beyond" and not actually a direct part of their responsibilities. And when the maintainer of a community tool is not available, they will also be severely limited in what they are permitted to do.

The long and short of it is, unless the maintainer resurfaces, our best bet is a different tool. Either a community fork of Phe's code; or the new tool that Community Tech has committed to investigating (note: "investigating", not actually "making"; it may not be feasible for them); or a new community-made tool (*cough* maybe someone could kick T239934 *cough*).

@Xover That's a plausible theory, though I was interpreting "abuse" in plain English, not in any jargony way; if you're right about the intent, I'd have expected a more straightforward construction like "misuse" or simply "the priority system is not the appropriate tool for expressing a preference." At any rate, not much use our speculating about James' intent...if he wishes to comment that will surely settle the matter.

Having reviewed the earlier discussion more closely, I appreciate your efforts to illuminate all the moving parts. I'm trying to fill in some gaps to get a clearer understanding of what is going on. Here's my broad sense -- I'd appreciate any comments or corrections to this:

  1. The Wikimedia Foundation has (since this bug was created, but by a different process, the Community Wishlist) committed to creating an OCR tool that might (?) address the problem resulting from the breakage of this community-created tool. https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2020/Wikisource/New_OCR_tool
  2. In all likelihood, WMF staff prefer to focus on the new tool, rather than addressing the complex issues you describe above.
  3. Since the Community Wishlist projects will not be created until some time in 2020, community members remain frustrated that important functionality is absent for many months.
  4. In the meantime, there is some small hope that volunteers could fix the existing OCR tool. @Adithyak1997 is attempting to fork and repair the project, and @Tpt may have the ability to repair it in place, but it doesn't appear either of those is likely to yield results very soon.

Is that accurate?

Adithyak1997 added a comment.EditedDec 19 2019, 4:11 AM

Actually, in my case, the problem is that I have not been added as a maintainer (note: doesn't mean I will solve it, but I will try).

Xover added a comment.Dec 19 2019, 7:10 AM

Is that accurate?

In essence.

But note that Community Tech—the team that runs the Community Wishlist Survey—has only committed to investigate a new OCR tool. It may yet turn out that it is too complex, or for other reasons is not a feasible project within the scope of the Wishlist.

In addition, the reasons parts of the community prefer Phe's OCR tool over existing text layers in the PDF/DjVu or the Google OCR gadget are not straightforward. It may still turn out Community Tech produces a new tool and parts of the community will not think it sufficient: partly because when you get right down to details, the community is not necessarily able to articulate with sufficient specificity what are the important properties of one tool over the other.

Koavf added a comment.Dec 19 2019, 7:17 AM

Fixing problems like this and maintaining tools is difficult. In this particular case, there is a very straightforward quick fix, which is to use Tesseract's OCR tool and replace the default one. It works very well and is not broken. I'm not sure if the developers are busy or are trying to make the perfect the enemy of the good or what but a really easy way to resolve this in (what I assume as someone who is not a Wikimedia developer) five minutes is to drag and drop Tesseract's tool into some folder on some server and overwrite the standard OCR that no longer works. Long-term solutions are good, the development team's time and expertise are precious, and protracted bugs that just end up with a lot of talking about talking don't help anyone, so this seems like the best solution but I'm willing to admit my own ignorance if someone can come up with a better solution.

@Koavf It's not that simple. Hit me up on my enWS user talk page if you want the nitty gritty details. The short version is that Phe's OCR gadget (which is what this task is about, and which is what I assume you mean by the "default") is based on Tesseract, but that's not where the problem is: it's somewhere in the custom code Phe has written to provide the interface to Tesseract or its interaction with the server infrastructure on Toolforge.

@Koavf It's not that simple. Hit me up on my enWS user talk page if you want the nitty gritty details. The short version is that Phe's OCR gadget (which is what this task is about, and which is what I assume you mean by the "default") is based on Tesseract, but that's not where the problem is: it's somewhere in the custom code Phe has written to provide the interface to Tesseract or its interaction with the server infrastructure on Toolforge.

I clearly don't know everything that's involved but there is an OCR gadget that works on en.ws, so if that functional gadget can't just replace the standard OCR button in the interface somehow, I'm at a real loss.

Tpt added a comment.Dec 19 2019, 10:32 AM

After spending an hour reading phe tools code and reading the tool logs, I believe I got an idea of the cause of the issue.

The background job that does the OCR itself seems to be running properly and starts Tesseract for each requested page. But then the job fails with a Tesseract error.
The error seems to be caused by Tesseract itself or the way phetools interact with Tesseract.

The Tesseract log is full of messages like:

Detected 18 diacritics
Image too small to scale!! (3x36 vs min width of 3)
Line cannot be recognized!!
read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 454

The Tesseract version installed on ToolsForge is:

tesseract 4.0.0-beta.1
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2

Installed from the stretch-backport Debian repository: https://packages.debian.org/stretch-backports/tesseract-ocr version 4.00~git2439-c3ed6f03-1~bpo9+1 (not the recent one, it seems that ToolsForge did not update to the latest version).

It suggests that the problem might be caused by https://github.com/tesseract-ocr/tesseract/issues/2288
If it is indeed the same problem, the fix should be to upgrade to Tesseract 4.1.0. Sadly, this version does not seem to be in Debian yet. The other option would be to downgrade to Tesseract 3 from the main Stretch repository, loosing the new OCR algorithm and support for some languages.

@Tpt First: you're awesome! :)

But second, the read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/ points me in a different direction. If Tesseract can't find find its configs it will not recognise any languages. Is this directory there? Does it contain any data? The correct data? Does tesseract have the correct permissions to read it? Is it on a network filesystem / NFS mount, and is that mount stale?

Given the error spit out from Python, and the line of code that does it (see above for both), I would also tend to suspect a problem with finding or accessing a temporary image file. The messages you quote from Tesseract (line can't be recognised, image too small, invalid resolution) also point in the direction of a missing or corrupt image input file. Is /tmp a normal filesystem? Or a RAM/flash-backed virtual filesystem? Is anything else "magical" about it? Is it strapped for free disk space? Do the temporary files get created as intended? Are they where we tell Tesseract to look for them? Does Tesseract have permissions to read them?

The Tesseract bug you link leads to Tesseract hanging in an infinite loop, which doesn't quite jibe with the visible symptoms or the logs you quote, and in any case that bug has not been fixed in Tesseract 4.1, but might be fixed in 4.1.1 (not certain: the absence of the problem was user-reported, but no Tesseract commit referenced that bug and its status is still open). I've also manually processed several thousand pages with Tesseract 4.0/4.1 and never once seen this hang occur, while phe-tools ocr seems to fail more often than work judging by the reports.

I clearly don't know everything that's involved but there is an OCR gadget that works on en.ws, so if that functional gadget can't just replace the standard OCR button in the interface somehow, I'm at a real loss.

There is another OCR gadget that uses the Google Vision API instead of Tesseract, but the users of Phe's OCR gadget generally do not consider it an acceptable replacement. However, nothing prevents you from using it if it is sufficient for your purposes (depends on language and type of work you usually work with, as well as an element of subjective preference and workflow). In any case, this Phabricator task is in regards the phe-tools OCR tool and discussions about other gadgets are best held either on-wiki or in a separate Phabricator task.

Koavf added a comment.Dec 19 2019, 5:41 PM

I do use it but by having be an opt-in gadget rather than the standard, we run the risk of losing editors, as mentioned above.

Xover added a comment.Dec 19 2019, 6:32 PM

@Koavf Which gadgets are enabled and which are on by default is up to each project.

So, the fallback OCR is now used instead of hOCR, until this bug is fixed.

This workaround works on all Wikisources which imports mul:MediaWiki:OCR.js. I’ve made an edit request on en.ws too, because they have their own gadget script.

Pols12 renamed this task from [phetools] en.ws OCR deletes old contents of a page, but does not generate new text. to [phetools] Wikisource OCR deletes old contents of a page, but does not generate new text..Dec 21 2019, 1:05 PM
Koavf added a comment.Dec 21 2019, 8:26 PM

Why is something that is so straightforward and so critical take so long?