Page MenuHomePhabricator

Create an OCR gadget using Tesseract.js
Closed, ResolvedPublic

Description

The current OCR implementation on Toolserver is not very stable. And Google OCR has usage limits.

The new version of Tesseract.js (2.0.0) allows to make high-quality recognition directly in the user's browser.

Event Timeline

putnik created this task.May 17 2019, 9:42 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 17 2019, 9:42 AM
putnik updated the task description. (Show Details)May 17 2019, 9:44 AM

The first version is ready. It is available on https://wikisource.org/wiki/User:Putnik/TesseractOCR.js

To use it just add code

mw.loader.load( '//wikisource.org/w/index.php?title=User:Putnik/TesseractOCR.js&action=raw&ctype=text/javascript' );

to your common.js.

You can also add a block with messages in your language before adding the script:

var tesseractOcrI18n = {
	'loading tesseract core': 'Loading Tesseract core',
	'initializing tesseract': 'Initializing Tesseract',
	'loading language traineddata': 'Loading language traineddata',
	'initializing api': 'Initializing API',
	'recognizing text': 'Recognizing text',

	'no text': 'No text retrieved from Tesseract',
	'image not found': 'No image found on this page',
	'button label': 'Get text via Tesseract OCR',
	'loading indicator': 'Animated loading indicator',
};
putnik closed this task as Resolved.May 17 2019, 12:31 PM
Trizek-WMF added a subscriber: Trizek-WMF.

Since it is targeting all Wikisources, it worth an inclusion in Tech News.

I've understood that change as "OCR service has been improved to use Tesseract.js OCR". Correct?

Johan added a subscriber: Johan.May 23 2019, 10:08 AM

Ping @putnik ^

Makes sense for an announcement?

Johan removed putnik as the assignee of this task.Jul 8 2019, 11:05 AM
Johan removed a project: User-notice.