Page MenuHomePhabricator

[phetools] Wikisource OCR deletes old contents of a page, but does not generate new text.
Closed, ResolvedPublic

Description

The OCR tool available in Gadgets, does not work anymore and here are the steps to reproduce the problem:

  1. Using Firefox 68.0.1, log in to English Wikisource.
  2. Select OCR in Preferences\Gadgets.

https://en.wikisource.org/w/index.php?title=Page:Mexico_under_Carranza.djvu/95&action=edit&debug=true Open this, or any other page in edit mode.

Click on the OCR and it's "frozen".

Save and close document. The OCR deletes the existing text but does not replace it. as is seen here.

The following are the error messages (5) in the debug console generated by activating the OCR:

JQMIGRATE: Migrate is installed with logging active, version 3.0.1 load.php:316:217

This page is using the deprecated ResourceLoader module "jquery.throttle-debounce".

Please use OO.ui.throttle/debounce instead. See https://phabricator.wikimedia.org/T213426 load.php:787:260

This page is using the deprecated ResourceLoader module "jquery.ui.position". load.php:57:298

This page is using the deprecated ResourceLoader module "jquery.ui.widget". load.php:88:949

This page is using the deprecated ResourceLoader module "jquery.ui.core".

Please use OOUI instead. load.php:11:91

JQMIGRATE: jQuery(window).on('load'...) called after load event occurred load.php:316:792

JQMIGRATE: jQuery.fn.bind() is deprecated load.php:316:792
Exception: ve.init is undefined 

mw.libs.ve.EditingTabDialog.prototype.getSetupProcess/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.visualEditor.switching%7Coojs-ui%7Coojs-ui.styles.icons-accessibility&skin=vector&version=0wrn2bo:4:376
proceed/<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:712:96
mightThrow@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:223:916
resolve/</process<@https://en.wikisource.org/w/load.php?lang=en&modules=ext.CodeMirror%2CTemplateWizard%2Ccharinsert%2CeventLogging%2CnavigationTiming%2CwikimediaEvents%7Cext.CodeMirror.data%7Cext.centralNotice.geoIP%7Cext.centralauth.ForeignApi%7Cext.centralauth.centralautologin.clearcookie%7Cext.echo.api%2Cinit%7Cext.proofreadpage.icons%7Cext.proofreadpage.page.edit%2Cnavigation%7Cext.proofreadpage.ve.pageTarget.init%7Cext.uls.common%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.progressBarWidget%2CsupportCheck%2CtargetLoader%2CtempWikitextEditorWidget%2Ctrack%2Cve%7Cext.wikimediaEvents.loggedin%7Cjquery%2Cmoment%2Coojs%2Coojs-ui-core%2Coojs-ui-toolbars%2Coojs-ui-widgets%2Coojs-ui-windows%2Csite%7Cjquery.accessKeyLabel%2CcheckboxShiftClick%2Cclient%2Ccookie%2CgetAttrs%2ChighlightText%2ClengthLimit%2CmakeCollapsible%2Cmousewheel%2CprpZoom%2Csuggestions%2CtabIndex%2CtextSelection%2Cthrottle-debounce%7Cjquery.makeCollapsible.styles%7Cjquery.uls.data%7Cmediawiki.ForeignApi%2CForeignStructuredUpload%2CForeignUpload%2CRegExp%2CString%2CTitle%2CUpload%2CUri%2Capi%2Cbase%2Ccldr%2Ccookie%2Cexperiments%2Cicon%2CjqueryMsg%2Clanguage%2Cnotify%2CsearchSuggest%2Cstorage%2Ctemplate%2Cuser%2Cutil%2Cwidgets%7Cmediawiki.ForeignApi.core%7Cmediawiki.ForeignStructuredUpload.BookletLayout%7Cmediawiki.Upload.BookletLayout%2CDialog%7Cmediawiki.action.edit%7Cmediawiki.action.edit.collapsibleFooter%7Cmediawiki.language.specialCharacters%7Cmediawiki.legacy.wikibits%7Cmediawiki.libs.jpegmeta%2Cpluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.page.watch.ajax%7Cmediawiki.template.regexp%7Cmediawiki.widgets.CategoryMultiselectWidget%2CDateInputWidget%2CStashedFileWidget%2CUserInputWidget%2CvisibleLengthLimit%7Cmediawiki.widgets.DateInputWidget.styles%7Coojs-ui-toolbars.icons%7Coojs-ui-widgets.icons%7Coojs-ui-windows.icons%7Coojs-ui.styles.icons-content%2Cicons-editing-advanced%2Cicons-editing-citation%2Cicons-editing-core%2Cicons-editing-list%2Cicons-editing-styling%2Cicons-interactions%2Cicons-media%2Cicons-movement%7Cskins.vector.js%7Cuser.defaults%7Cwikibase.client.action.edit.collapsibleFooter&skin=vector&version=0as2p6p:224:589
 undefined 2 load.php:226:749

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Koavf: What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Also note that the Priority field summarizes and reflects reality and does not cause it.

Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.Sep 10 2019, 12:17 PM

[…] What happened or has changed that this task suddenly became "Unbreak Now!" priority, weeks after it was created?

Ah, I think it was an expression of the perceived criticality of the issue for those who use the functionality, which is a factor that genuinely does increase with elapsed time. For those who do use the OCR gadget it is so central to their workflow, and its absence so detrimental to their productivity in what is the primary process on Wikisource, that this issue is comparable to the one where it could take up to a minute to save a large article on enwp (speaking from experience there, *shudder*). And from the perspective of Wikisource contributors it is not obvious that this gadget is provided by a volunteer developer: it's an important and core part of the platform to them.

In any case, to the degree WMF resources can contribute to resolving this it might be worthwhile to do so. For example by restarting the service, or checking that it's not hanging on a stale NFS mount, or other sysadmin stuff like that. It's also not completely exotic (javascript and python, links in an above comment), so it should be entirely possible for someone without intimate familiarity with this code to do at least some debugging without a truly major investment of time. Based on my observations of the line of code that's throwing an error, my money is on this being a filesystem unavailable or external binary (tesseract) being switched out from under the tool, or some other "hosting environment"-related issue (I don't think the tool/gadget itself has been changed in a long time, and the relevant error messages occur server side).

@Xover I'm not a coder, but I'm willing to spend some time looking at things. I'll see if I can at least isolate the problem for why this bug occurred, then that should expedite the solution hopefully. We'll see what happens.

Indeed @Xover this is also what I understand from lines 62 to 67 of this file and I'd bet on a "hosting environment"-related issue as similar issues seem to have happened in the past, each time solved by @Phe:

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

@Ineuw asked in that issue "Is it possible to implement the OCR locally?", I guess maybe @Phe has specific language data files for Tesseract which provide such good quality OCRs (as mentioned by several people about this gadget), which would be necessary to make the tool available on another server?

Neither Phe's OCR in my user space nor the Gadget OCR work on some volumes. You can test this:

https://en.wikisource.org/wiki/Index:Vol_5_History_of_Mexico_by_H_H_Bancroft.djvu

But here is an interesting scenario:

  1. Unchecked the OCR in Gadgets.
  2. Saved the change.
  3. Enabled mw.loader.load('//en.wikisource.org/w/index.php?title=User:Ineuw/OCR.js.js&action=raw&ctype=text/javascript'); in my vector.js
  4. And this also enabled the OCR in Gadgets. The previously unchecked box became checked again.

The last time, in May, he wrote he "restarted the service" after @Peteforsyth "left a note on his github page" (but I don't know how to do that and I don't see any issue about this problem in the issue list of phe-tool's GitHub repository).

I just reported this bug on phil-el/phetools#12.

So months are passing by and the only suggested solution is: Send messages in all directions and find Phe the Magician, or wait until s/he appears, nobody else is able to fix it. I apologize for strong words, but the whole thing and the Phabricator itself are becoming quite ridiculous. Technical support, why have you forsaken us...?

Koavf triaged this task as Unbreak Now! priority.Oct 8 2019, 8:05 PM

I escalated the priority: this is fundamental to how Wikisource operates. If we want to replace this with Tesseract as a solution, that's fine--just do so for all users.

It looks like @Tpt is also a maintainer of Phetools, so might be able to help.

Aklapper renamed this task from en.ws OCR deletes old contents of a page, but does not generate new text. to [phetools] en.ws OCR deletes old contents of a page, but does not generate new text..Nov 12 2019, 7:09 PM
Aklapper added a project: Upstream.
Aklapper moved this task from Backlog to Reported Upstream on the Upstream board.

no good if its maintenance depends on availability of a single person.

If no maintainers exist, feel free to escalate by creating a dedicated task under Toolforge-standards-committee (Maintainer needed) - see https://phabricator.wikimedia.org/project/profile/2952/ for more information.

Technical support, why have you forsaken us...?

[off-topic] There might be a misunderstanding. Phabricator is not a general "Technical support" place. It's a place to report specific issues. Having a place to report issues does not mean that suddenly every project has active maintainers, or that someone will magically appear to fix some issue in one of the literally thousands of projects.

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to High.Dec 17 2019, 11:21 AM
Jdforrester-WMF subscribed.

This does not meet the criteria for UBN. Please do not abuse the prioity system.

This does not meet the criteria for UBN. Please do not abuse the prioity system.

You may disagree with the assessment but it's not abuse: other users agreed that this is a valid task priority.

Agreed, @Koavf - when WMF staff have used terms like "abuse" in the past, it has sometimes been accompanied by punitive action. It's worthwhile to point out when WMF staff use sensational language like this directed toward users. @Jdforrester-WMF, any chance you could comment or retract the use of that loaded word? In addition, it would be helpful if you could provide some guidance about how you would like users to indicate preferences around task priority.

I'd also point out, @Aklapper, this is not merely "some issue in one of the literally thousands of projects." As @Koavf noted above, OCR is fundamental to how Wikisource operates. Something coded by a volunteer who has since become less available has become quite important to a very important Wikimedia project's success. This would not be the first time; volunteer innovation has always been an important, and complicating, factor in how Wikimedia has grown. While I do not pretend to know how WMF staff like the "importance" feature to be used, and I have no opinion on whether or not "unbreak now" is an appropriate or useful status, I would strongly urge WMF staff to avoid minimizing the significance of the problem, or scolding users who are trying to help get something important fixed.

Moreover, the problem does not concern ''only'' enwikisource, but all wikisource that use this tool, frwikisource being impacted very much as well...

Aklapper had already pointed above a MediaWiki.org link which explains how the Phabricator priority system work: mw:Phabricator/Project_management#Setting_task_priorities.

Phabricator is a tool which allows developers to organize openly their work. They do not have to manage community here. You probably may engage discussions on Meta or on your wiki village pump, eventually pinging a Community Liaisons member to express your priority wishes.

At the risk of sounding condescending, this *is* fundamental to Wikisource. While some sources are digital-native, the majority of what Wikisource is is transcription of print sources and that is impossible without the ProofreadPage which relies on an OCR scan to reduce easily between 30 and 90% of the workload, depending on the source. Again, if what you need is a quick fix to descalate this, just make the Tesseract OCR scan the actual tool in Wikisource and we can sort out the details of a best practice later. This needs to be fixed ASAP.

Aklapper had already pointed above a MediaWiki.org link which explains how the Phabricator priority system work: mw:Phabricator/Project_management#Setting_task_priorities.

Phabricator is a tool which allows developers to organize openly their work. They do not have to manage community here. You probably may engage discussions on Meta or on your wiki village pump, eventually pinging a Community Liaisons member to express your priority wishes.

I don't understand the distinction here.

I don't understand the distinction here.

@Koavf The priority field in Phabricator is defined as reflecting what priority the responsible team at the WMF has assigned a task on which they are currently working, or will shortly be working. It is not intended to be used for a reporter or community member to request or argue for with what priority the responsible team should address the issue. If there are good reasons for changing the priority, the argument can be made in a comment on the task; but unless the argument actually persuades the responsible team they should give it higher priority, the priority field should not be changed.

And previous comments have already emphasised the importance of this tool to the affected communities.

Agreed, @Koavf - when WMF staff have used terms like "abuse" in the past, it has sometimes been accompanied by punitive action. … @Jdforrester-WMF, any chance you could comment or retract the use of that loaded word?

@Peteforsyth "Abuse", in this context, means "using for a purpose for which it is not intended". It is not an invocation of the wiki-specific connotation of the word "abuse".

And in general: as a community tool there is no team at the WMF whose responsibility this is, except in the vaguest possible way that all teams want to help the communities thrive. Anything they might do here would be "going above and beyond" and not actually a direct part of their responsibilities. And when the maintainer of a community tool is not available, they will also be severely limited in what they are permitted to do.

The long and short of it is, unless the maintainer resurfaces, our best bet is a different tool. Either a community fork of Phe's code; or the new tool that Community Tech has committed to investigating (note: "investigating", not actually "making"; it may not be feasible for them); or a new community-made tool (*cough* maybe someone could kick T239934 *cough*).

@Xover That's a plausible theory, though I was interpreting "abuse" in plain English, not in any jargony way; if you're right about the intent, I'd have expected a more straightforward construction like "misuse" or simply "the priority system is not the appropriate tool for expressing a preference." At any rate, not much use our speculating about James' intent...if he wishes to comment that will surely settle the matter.

Having reviewed the earlier discussion more closely, I appreciate your efforts to illuminate all the moving parts. I'm trying to fill in some gaps to get a clearer understanding of what is going on. Here's my broad sense -- I'd appreciate any comments or corrections to this:

  1. The Wikimedia Foundation has (since this bug was created, but by a different process, the Community Wishlist) committed to creating an OCR tool that might (?) address the problem resulting from the breakage of this community-created tool. https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2020/Wikisource/New_OCR_tool
  2. In all likelihood, WMF staff prefer to focus on the new tool, rather than addressing the complex issues you describe above.
  3. Since the Community Wishlist projects will not be created until some time in 2020, community members remain frustrated that important functionality is absent for many months.
  4. In the meantime, there is some small hope that volunteers could fix the existing OCR tool. @Adithyak1997 is attempting to fork and repair the project, and @Tpt may have the ability to repair it in place, but it doesn't appear either of those is likely to yield results very soon.

Is that accurate?

Actually, in my case, the problem is that I have not been added as a maintainer (note: doesn't mean I will solve it, but I will try).

Is that accurate?

In essence.

But note that Community Tech—the team that runs the Community Wishlist Survey—has only committed to investigate a new OCR tool. It may yet turn out that it is too complex, or for other reasons is not a feasible project within the scope of the Wishlist.

In addition, the reasons parts of the community prefer Phe's OCR tool over existing text layers in the PDF/DjVu or the Google OCR gadget are not straightforward. It may still turn out Community Tech produces a new tool and parts of the community will not think it sufficient: partly because when you get right down to details, the community is not necessarily able to articulate with sufficient specificity what are the important properties of one tool over the other.

Fixing problems like this and maintaining tools is difficult. In this particular case, there is a very straightforward quick fix, which is to use Tesseract's OCR tool and replace the default one. It works very well and is not broken. I'm not sure if the developers are busy or are trying to make the perfect the enemy of the good or what but a really easy way to resolve this in (what I assume as someone who is not a Wikimedia developer) five minutes is to drag and drop Tesseract's tool into some folder on some server and overwrite the standard OCR that no longer works. Long-term solutions are good, the development team's time and expertise are precious, and protracted bugs that just end up with a lot of talking about talking don't help anyone, so this seems like the best solution but I'm willing to admit my own ignorance if someone can come up with a better solution.

@Koavf It's not that simple. Hit me up on my enWS user talk page if you want the nitty gritty details. The short version is that Phe's OCR gadget (which is what this task is about, and which is what I assume you mean by the "default") is based on Tesseract, but that's not where the problem is: it's somewhere in the custom code Phe has written to provide the interface to Tesseract or its interaction with the server infrastructure on Toolforge.

@Koavf It's not that simple. Hit me up on my enWS user talk page if you want the nitty gritty details. The short version is that Phe's OCR gadget (which is what this task is about, and which is what I assume you mean by the "default") is based on Tesseract, but that's not where the problem is: it's somewhere in the custom code Phe has written to provide the interface to Tesseract or its interaction with the server infrastructure on Toolforge.

I clearly don't know everything that's involved but there is an OCR gadget that works on en.ws, so if that functional gadget can't just replace the standard OCR button in the interface somehow, I'm at a real loss.

After spending an hour reading phe tools code and reading the tool logs, I believe I got an idea of the cause of the issue.

The background job that does the OCR itself seems to be running properly and starts Tesseract for each requested page. But then the job fails with a Tesseract error.
The error seems to be caused by Tesseract itself or the way phetools interact with Tesseract.

The Tesseract log is full of messages like:

Detected 18 diacritics
Image too small to scale!! (3x36 vs min width of 3)
Line cannot be recognized!!
read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 454

The Tesseract version installed on ToolsForge is:

tesseract 4.0.0-beta.1
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2

Installed from the stretch-backport Debian repository: https://packages.debian.org/stretch-backports/tesseract-ocr version 4.00~git2439-c3ed6f03-1~bpo9+1 (not the recent one, it seems that ToolsForge did not update to the latest version).

It suggests that the problem might be caused by https://github.com/tesseract-ocr/tesseract/issues/2288
If it is indeed the same problem, the fix should be to upgrade to Tesseract 4.1.0. Sadly, this version does not seem to be in Debian yet. The other option would be to downgrade to Tesseract 3 from the main Stretch repository, loosing the new OCR algorithm and support for some languages.

@Tpt First: you're awesome! :)

But second, the read_params_file: Can't open /usr/share/tesseract-ocr/4.00/tessdata/configs/ points me in a different direction. If Tesseract can't find find its configs it will not recognise any languages. Is this directory there? Does it contain any data? The correct data? Does tesseract have the correct permissions to read it? Is it on a network filesystem / NFS mount, and is that mount stale?

Given the error spit out from Python, and the line of code that does it (see above for both), I would also tend to suspect a problem with finding or accessing a temporary image file. The messages you quote from Tesseract (line can't be recognised, image too small, invalid resolution) also point in the direction of a missing or corrupt image input file. Is /tmp a normal filesystem? Or a RAM/flash-backed virtual filesystem? Is anything else "magical" about it? Is it strapped for free disk space? Do the temporary files get created as intended? Are they where we tell Tesseract to look for them? Does Tesseract have permissions to read them?

The Tesseract bug you link leads to Tesseract hanging in an infinite loop, which doesn't quite jibe with the visible symptoms or the logs you quote, and in any case that bug has not been fixed in Tesseract 4.1, but might be fixed in 4.1.1 (not certain: the absence of the problem was user-reported, but no Tesseract commit referenced that bug and its status is still open). I've also manually processed several thousand pages with Tesseract 4.0/4.1 and never once seen this hang occur, while phe-tools ocr seems to fail more often than work judging by the reports.

I clearly don't know everything that's involved but there is an OCR gadget that works on en.ws, so if that functional gadget can't just replace the standard OCR button in the interface somehow, I'm at a real loss.

There is another OCR gadget that uses the Google Vision API instead of Tesseract, but the users of Phe's OCR gadget generally do not consider it an acceptable replacement. However, nothing prevents you from using it if it is sufficient for your purposes (depends on language and type of work you usually work with, as well as an element of subjective preference and workflow). In any case, this Phabricator task is in regards the phe-tools OCR tool and discussions about other gadgets are best held either on-wiki or in a separate Phabricator task.

I do use it but by having be an opt-in gadget rather than the standard, we run the risk of losing editors, as mentioned above.

@Koavf Which gadgets are enabled and which are on by default is up to each project.

So, the fallback OCR is now used instead of hOCR, until this bug is fixed.

This workaround works on all Wikisources which imports mul:MediaWiki:OCR.js. I’ve made an edit request on en.ws too, because they have their own gadget script.

Pols12 renamed this task from [phetools] en.ws OCR deletes old contents of a page, but does not generate new text. to [phetools] Wikisource OCR deletes old contents of a page, but does not generate new text..Dec 21 2019, 1:05 PM

Why is something that is so straightforward and so critical take so long?

But note that Community Tech—the team that runs the Community Wishlist Survey—has only committed to investigate a new OCR tool. It may yet turn out that it is too complex, or for other reasons is not a feasible project within the scope of the Wishlist.

@Xover - It’s looking very likely that Community Tech will in fact be able to work on an improved OCR tool for Wikisource. They’ve already been investigating various OCR engines (including Tesseract) and talking with potential 3rd party API providers. The whole virus situation has slowed everything down a bit, but things are still moving forward.

@Tpt, @Xover - We have upgraded Tesseract on Toolforge to 4.1.1. Can either of you confirm whether this fixed the problem with phetools OCR tool? It should also be noted that the Google OCR gadget has been improved in the past year and may be a viable alternative.

Community Tech has not yet started on the "New OCR tool" request from the Community Wishlist, but is hoping to get to that soon. They're working on the "Improve export of electronic books" request at the moment.

@kaldari Nope, still seeing the same failure mode. It greys out the text in the editor and then throws an error in the JS console ala. An error occurred during ocr processing: /tmp/52004_6179/page_0199.tif.

I don't think this is fixable without hands-on debugging along the lines @Tpt started in T228594#5753630 up above; and it's even possible Tesseract 4.x is the cause of the problem and the fix to downgrade to Tesseract 3.x (it was written for 3.x and uses custom data files; and Tesseract's output has changed from 3.x to 4.x).

We are currently recommending the Google OCR gadget as an alternative: it generally works well, and seems to be both robust and fast. But it has some different properties (mostly in terms of different tradeoffs) that some parts of the community find insufficient, and the privacy policy doesn't currently allow us to turn it on by default (due to being hosted on Toolforge and for talking to a non-WMF API).

Ok, having gotten access to the project in connection with T265640 I've been trying to debug this a bit.

First off, this is an insanely complicated codebase: it's got a whole private gridengine job management layer, speculative processing of pages, and a cache going back to 2015.

And that last bit is most likely the key here.

The error message that pops up in the browser's console comes from ocr.ocr() when it fails to open the TIFF file extracted from the DjVu that it is trying to run Tesseract on. For some reason the code doesn't actually throw any exception here: it writes that error message to its output file, where the OCR text is usually output, and then returns a normal status to its calling context.

I've been tearing my hair out trying to trace the call chain here to try to figure out why that TIFF is missing, and failing miserably because the Gridengine job that would invoke ocr.ocr() is never created. When what seems to be happening is actually that it isn't called: the error message we're seeing now was output the first time this DjVu file was processed, persisted as if it was OCR text, and is now being returned from cache. It's not setting error: 1 in the JSON because from that process' perspective this isn't an error: it's returning what it thinks is normal OCR text, which, I'm assuming, bombs client-side because it's in the wrong format.

In other words, the error that caused this happened years ago and is probably no longer relevant.

I'm digging further but this is complicated by the fact that parts of the relevant information only exists in the database, and the cache is massive (every book processed since 2015), and the sheer volume of logs that have accumulated since 2015 (~70k distinct log files in a single directory; too much data for du to complete before SSH times out). There's code to nuke and rebuild that cache but it's partially disabled and I have no idea whether it's supposed to be functional (using it may fail catastrophically).

I'm hoping that if I can identify the cached files for the test work here and manually nuke it, that may be enough to get it to actually process it again: either successfully, or with a fresh failure that can actually be debugged.

Ah, yes.

The cache for a given work will be in a subdirectory of ~/cache/hocr/ created from the MD5 hash of the file's name (spaces replaced with underscores) concatenated with the invoking project's language code. So for Mexico_under_Carranza.djvu requested from English Wikisource, you can generate the hash with…

> md5 -s Mexico_under_Carranza.djvuen
MD5 ("Mexico_under_Carranza.djvuen") = 09249a0e01d2a8b1bc920842e4549efc

…and the cache will be in ~/cache/hocr/09/24/9a0e01d2a8b1bc920842e4549efc/.

If you nuke the cache directory you can invoke the OCR gadget on a page from that work and it will spawn an SGE job that grabs the DjVu file (in ~/tmp/hocr/en/Mexico_under_Carranza.djvu) and generates bzip2-compressed OCR text for each page in the cache. Future invocations of the gadget on a page for that work will then return the cached OCR (fast).

There is no reaper for this cache so it will persist forever, including garbage data if a transient error of some stripe triggers the previously described error.

This fixed the problem for the test case "Mexico_under_Carranza.djvu", but there were multiple problems reported with the OCR gadget so there may still be other issues lurking here. I'll go through the reports I can find, but now having test cases (pages where the OCR button doesn't work) would be very good.

Anyone got any that they can verify is still broken? Or conversely, that were broken but have now started working?

There have also been multiple changes to the hosting environment (new versions of supporting libraries, major new release of Tesseract, etc.) since last this code was touched which may affect it in subtle (and less subtle) ways, so testing would be a good idea.

You can test on any page on Page: namespace on English Wikisource. This seems to work for most pages, but it is still broken for some others, e.g. Page:Europe in China.djvu/422, Page:Europe in China.djvu/422 or Page:Plutarch's Lives (Clough, v.4, 1865).djvu/215.

Both of these files were affected by the same bug as described above: when the first request for OCR of a page in the work was requested (10 July 2019 and 1 December 2019, respectively) there was some kind of transient error that caused the OCR function to not find the temporary files containing each page image, but the OCR function blithely wrote the error message into the cache as if it were successfully extracted OCR text. Ever since, every request for pages from this work has returned that cached error message, and since the error message isn't correctly formatted the javascript running in your web browser has failed silently (except the message in the developer console).

I have now deleted the caches for both these works and successfully gotten OCR for the pages you linked to. The remaining pages of both works are currently being processed, and within a couple of hours requests for pages from these works should start returning cached data (i.e. fast).

So far, all the cases I have found have been symptoms of this specific problem, so I am looking at writing a script to analyse the cache and remove any files affected by this. But it would be nice to have more test cases first to verify that this is really the only / main failure people are being affected by.

@Xover - What would be the effect of just deleting all the caches? Tesseract has been upgraded since most of those caches were generated anyway.

@Xover - What would be the effect of just deleting all the caches? Tesseract has been upgraded since most of those caches were generated anyway.

A lot of already-spent CPU cycles wasted, and a moderate amount of new cycles spent on regenerating OCR for those works if and when a page from one of them is requested. Probably not so much that it really matters, but there's 42k+ works in there (with a probable average number of pages in the hundreds) that it'd be a shame to lose.

And it's not a given that Tesseract 4.x generates better (or equally good) results, since better is a subjective measure in this case.

If the problematic cache entries can be reliably identified, and writing the script doesn't run into any unexpected difficulties, that's my Plan A; with just deleting the whole cache as the Nuclear Option.

It is great something is happening with this issue!
However, I tested it now e. g. on Page:John_Huss,_his_life,_teachings_and_death,_after_five_hundred_years.pdf/122 and some other pages of the same book and it still does not work here :-(

… I tested it now e. g. on Page:John_Huss,_his_life,_teachings_and_death,_after_five_hundred_years.pdf/122 and some other pages of the same book and it still does not work here :-(

Yup, same issue on this file as the ones above. I've nuked the cache for this file and now get (slow) OCR for it. @Jan.Kamenicek can you retest now?

The full-work OCR will kick in soon and should generate a new cache for it within a couple of hours, at which point OCR results should start coming back fast.

If you have other works where this OCR gadget fails then please list them here. It will be useful for verifying that it's the same issue affecting them, and meanwhile I can manually nuke the cache for them so they start working again.

Thanks very much. Now it works, although the result is very poor, much worse than what I was used to before it got broken: every word is on a new line. At first I thought it can be a problem connected only with this particular work, so I tried it also on Page:The_story_of_Prague.djvu/131 (this work already has it own OCR layer, but I tried to overwrite it with our OCR for testing purposes) and the result is the same: every word on a new line.
Besides that I tested it also on other works (they all have their own OCR layer, but I just wanted to try it): Page:The Bohemian Review, vol2, 1918.djvu/229, Page:The Czechoslovak Review, vol3, 1919.djvu/430, and Page:Poet Lore, volume 28, 1917.djvu. Unfortunately, the OCR does not work with any of these at all :-(

Unfortunately, the OCR does not work with any of these at all

The works…

  • The Bohemian Review, vol2, 1918.djvu
  • The Czechoslovak Review, vol3, 1919.djvu
  • Poet Lore, volume 28, 1917.djvu

…were all affected by the same bug discussed above. I've deleted their caches and you should now actually get OCR results for these (slow now, fast once the preprocessing completes).

Now it works, although the result is very poor, much worse than what I was used to before it got broken: every word is on a new line.

This is a separate problem, and is most likely related to Tesseract being upgraded to 4.x. Phetools were written for Tesseract 3.x and there were output changes between the versions, so I'm guessing the code for parsing Tesseract's output is not dealing optimally with the changed output. Alternately it may be due to other changes in the infrastructure or supporting libraries. It's probably not that Tesseract 4.x simply produces poorer results on some works because I don't see that in my own (hacky, in development) gadget that also uses Tesseract 4.x.

I'll investigate this to try to pinpoint the cause and see what can be done; but since this is likely to require code changes (i.e. it's a "programming" task rather than a "sysadmin" task) it may be somewhat harder to address.

… the [OCR] result is very poor, …: every word is on a new line.

This is a separate problem, and is most likely related to Tesseract being upgraded to 4.x.

Hmm. It may be related to that, but on closer inspection it appears equally likely to be related to the user's web browser.

It turns out the phetools backend returns raw hOCR (a HTML microformat) to the Gadget (wrapped in JSON, but that's unwrapped by jQuery), where the Gadget inserts it into the DOM and uses jQuery's .text() to get a plain text representation back (which is then inserted in #wpTextbox1).

This means that the text wrapping is implementation dependent and variable based on how the browser in question handles whitespace in .textContent.

Ok, an update on the corrupted cache…

After ~28 hours wallclock runtime, the script had finished checking all pages in 42104 works. Of these, ~23% (9644) had at least one page affected by this problem. I've now deleted the affected works so that should hopefully be the end of that particular error.

If anyone now sees an instance of a page where it fails the same way (grays out the text box but never replaces it with OCR text) please report it.

This still leaves the OCR quality issue (one word per line as Jan reported above). On that I haven't been able to pinpoint the root cause, but I've made a modified version of the Gadget that tries to compensate for it. I'm waiting for some independent testing to verify it works and produces reasonable results before flagging down an interface-admin to update the project-wide Gadget.

Ok, I've now had some independent testing (Big big thank you to Jan!) that confirms the tweaked Gadget code now produces results that are at least within a reasonable distance of what it used to produce.

@Inductiveload Could you apply this diff (that is, copy User:Xover/Gadget-ocr.js over MediaWiki:Gadget-ocr.js)?

I'll write a note to the enWS Scriptorium to notify the community there that the OCR Gadget now should be functional again. I know at least some non-English Wikisourcen have also reported this problem (I think I see at least Malayalam, French, Russian, and Chinese in the subscribers to this task) so I'll look into the best way to let them know too. In the meantime I suggest we keep this task open to collect any additional problems that may still be lurking.

Several Wikisources (actually, at least French one) import mul:MediaWiki:OCR.js. I had requested a small change to fallback to old OCR when got text is an error message instead of XML content:

function hocr_callback(data) {
	if ( data.error || data.text.substring(0,5)!="<?xml" ) {

Should this change be integrated in your new version, @Xover?

…fallback to old OCR when got text is an error message instead of XML content:

function hocr_callback(data) {
	if ( data.error || data.text.substring(0,5)!="<?xml" ) {

Hmm. In what scenario are you seeing the returned data start with an XML Declaration and represent an error?

The correct data that is returned does match that pattern because Tesseract's hOCR implementation embeds the microformat in XHTML, so I would assume that check would trigger primarily when the tool is otherwise working correctly. What am I missing?

@Xover, I think it is a misunderstanding
data.text.substring(0,5) != "<?xml" -> XML is accepted, if it is not XML, then is considered error.

Thanks a lot ! I've just tested the OCR on French wikisource, and it seems to work, efficiently AND diligently :)

we had lost hope ;)

Thanks very much. Now it works, although the result is very poor, much worse than what I was used to before it got broken: every word is on a new line. At first I thought it can be a problem connected only with this particular work, so I tried it also on Page:The_story_of_Prague.djvu/131 (this work already has it own OCR layer, but I tried to overwrite it with our OCR for testing purposes) and the result is the same: every word on a new line.

Same feedback as @Jan.Kamenicek tonight, although it seemed to worked great a week ago.

I've tested it on Chrome, Firefox and Edge with the same problem each time, and it doesn't seem to come from the browser considering for example the following url: https://phetools.toolforge.org//hocr_cgi.py?cmd=hocr&book=Compain%20-%20La%20vie%20tragique%20de%20Genevi%C3%A8ve%2C%201912.pdf%2F230&lang=fr&user=George2etexte for page https://fr.wikisource.org/w/index.php?title=Page:Compain_-_La_vie_tragique_de_Genevi%C3%A8ve,_1912.pdf/230&action=edit&redlink=1. The output contains lots of \n characters (one after each word), which are not explained by line breaks in the text, as shown in this extract of the output at the url given above:

<span class='ocr_line' id='line_1_4' title=\"bbox 257 577 2092 674; baseline -0.004 -22; x_size 95; x_descenders 25; x_ascenders 27\">\n      <span class='ocrx_word' id='word_1_19' title='bbox 257 609 350 652; x_wconf 96'>ou</span>\n      <span class='ocrx_word' id='word_1_20' title='bbox 392 578 755 674; x_wconf 95'>rapporter</span>\n      <span class='ocrx_word' id='word_1_21' title='bbox 782 581 1144 671; x_wconf 72'>l&#39;ouvrage</span>\n      <span class='ocrx_word' id='word_1_22' title='bbox 1176 581 1261 648; x_wconf 96'>de</span>\n      <span class='ocrx_word' id='word_1_23' title='bbox 1294 580 1562 657; x_wconf 94'>l\u2019autre,</span>\n

Thanks a lot for the time you took in repairing the tool, I hope this last step will be easy to repair.

@Xover, I think it is a misunderstanding
data.text.substring(0,5) != "<?xml" -> XML is accepted, if it is not XML, then is considered error.

@Pols12 and @Mpaa Oh, of course, you're both right and I'm a dummy. :)

Yes, that's a very good workaround for the problems caused by the corrupted cache. I don't like it as an approach when it's not needed, because it will break if Tesseract's hOCR implementation changes (and XHTML has been out of fashion for going on two decades now), but other than that it does no harm and will catch any future instances of the same kind of error.

… every word is on a new line. …

Same feedback as @Jan.Kamenicek tonight, although it seemed to worked great a week ago.

Thanks for testing!

A week ago the actual tool was broken, but the local javascript that handled the broken output detected that and applied a (slow, but functional) workaround. Now that the actual tool has been fixed that workaround no longer has any effect, and the weird output you're seeing is due to changes in the output format since the code was originally written.

@Ankry @Koavf frWS (and several other projects) cross-load the script at mul:MediaWiki:OCR.js. Could you apply this diff from enWS (which uses a local copy)? That should, hopefully, fix the problem for most language projects that use this tool.

@Ankry @Koavf frWS (and several other projects) cross-load the script at mul:MediaWiki:OCR.js. Could you apply this diff from enWS (which uses a local copy)? That should, hopefully, fix the problem for most language projects that use this tool.

Done.

Could you apply this diff

Done.

@Ankry Thanks! @FreeCorp This looks like it has fixed the issue on the page you linked to. Can you verify?

@Ankry Thanks! @FreeCorp This looks like it has fixed the issue on the page you linked to. Can you verify?

Thank you very much, now it's OK indeed!

TheDJ claimed this task.
TheDJ subscribed.

Can this be closed now ?