Page MenuHomePhabricator

Spike: New/Improved OCR tool [8 hours]
Closed, ResolvedPublic

Description

As a Wikisource user, I want the team to investigate the "New OCR tool" wish, so they can consider the various options and risks of this top Wikisource wish.

Background: In the 2020 Community Wishlist Survey, the #2 wish was "New OCR Tool." We may not be able to create a new OCR tool, but we can probably improve existing tool(s). The purpose of this spike is to investigate potential options.

Relevant Resources:

Acceptance Criteria:

  • Review the 2020 wish & relevant Phabricator tasks (see links above)
  • Review available OCR tools/options (i.e., OCR Gadget, IndicOCR, Google OCR, OCR4Wikisource, hOCR, Internet Archive)
  • Investigate various options outlined in the All Hands brainstorms doc
  • Determine if we can or cannot create a new OCR tool (so we know what to tell the community about this option)
  • Determine if we can improve the existing tools -- and, if so, what the improvements could be
  • Provide an analysis of potential risks associated with this project from a technical perspective
  • Provide an analysis of potential dependencies associated with this project from a technical perspective
  • Provide a rough estimate/sense of difficulty or effort required by this project

Event Timeline

ifried renamed this task from Spike: New/Improved OCR tool [placeholder] to Spike: New/Improved OCR tool .Feb 4 2020, 1:03 AM
ifried updated the task description. (Show Details)
ifried renamed this task from Spike: New/Improved OCR tool to Spike: New/Improved OCR tool [8 hours].Feb 5 2020, 12:49 AM
ifried moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Determine if we can or cannot create a new OCR tool (so we know what to tell the community about this option)

  • Creating a new OCR tool is not an option. It would take great time and effort to do so. The approach will focus on enhancing and stabilizing what is already in place.

Review available OCR tools/options (i.e., OCR Gadget, IndicOCR, Google OCR, OCR4Wikisource, hOCR, Internet Archive)

  • There are different ways in which users are using the OCR tool in Wikisource.
  • We should generalize the OCR process; therefore, we can focus our effort on improving the most 'normal' process.
  • Additionally, we can focus on replacing or adding a new extension-based toolbar button that talks to a new toolforge tool that would be easier to support in the long run.

Determine if we can improve the existing tools -- and, if so, what the improvements could be

  • Running the OCR (either via Google API or Tesseract) is reasonably straightforward; most of what we need to do is make the options of both of these — notably, the language options — available to the users.
  • We are currently running a really old Tesseract version in Toolforge so upgrading can be one of the first steps for the OCR tool that uses this library.

Provide an analysis of potential risks associated with this project from a technical perspective

  • One of the first steps can be upgrading Tesseract and see what it would take to support more languages. This might fix a handful of issues that the users are currently experiencing.
  • We can work on replacing the Phetools and Google OCR gadgets with a new extension. We can also leave the current OCR tools alone and work on a completely new backend and adding a new UI that would be easier to maintain and support.

Provide an analysis of potential resolution with this project from a technical perspective

  • The safest approach would be to create a new OCR tool that supports the two main OCR engines: Tesserac and Google OCR. This two engines together cover most languages, if not all.
  • We would need to add to the wiki editor interface the ability to trigger this new OCR functionality and allow the user to select between the two engines. We need to keep in mind that the wiki editor interface is limited when it comes to adding new elements.

Provide a rough estimate/sense of difficulty or effort required by this project

  • It would be hard to say how long this would take, but it should not be too difficult. It will require a significant time and effort though. We should have a better sense of time once we start diving into the project, and we start ironing out the details.

Are we going to be okay making the Wikisource extension depend on Toolforge services? If not, we'll have to make a gadget.

@Samwilson Could you detail some of the details and risks associated with this choice? I don't know enough about the tradeoffs to form a helpful viewpoint.

We don't have that in production so far, so I'm not sure, especially due to security concerns. We might need to have a gadget where each user must whitelist the toolforge URL individually for security and privacy.

I'm going to add some complexity here. @kaldari is connecting me with some folks at Internet Archive to discuss the possibility of our integrating more closely with their OCR workflow. The general idea would be to make the upload to IA and creation of pages in Wikisource more automated.

We know that IA is a part of these users' workflows. Making it more automated would benefit us, our users, and the IA community. I don't know what this would look like specifically but I'm going to talk to them about the options.

We should continue to consider the other options and work this in when there are more details.

More close integration with IA sounds brilliant! I feel like that might solve the side of this wish that's not been quite defined so far: the whole-work OCR process. I think we can probably carry on and make a general Google/Tesseract single-page toolbar button, and then separately have a better system for pulling whole works from IA.

I think we'd be able to do something better than the current state of IA Upload, or at least build on that to bring in some of the manual processes and make it more friendly (as well as actually work in more cases!).

I created an inventory of which OCR services support which Wikisource languages:
https://docs.google.com/spreadsheets/d/1qJIs1GMY4vY3qS6TimvuoMk6f_dTOB9xFjmq8eJrISw/edit?usp=sharing

Interestingly, the language coverage for all the services is pretty similar, except for Internet Archive, which doesn't support Indic languages, Bosnian, or Haitian.

Languages with no OCR support

Languages which have no support are Venetian, Neapolitan, Piedmontese, Limburgish, Min Nan, and Anglo-Saxon.

Venetian, Neapolitan, and Piedmontese are similar to Italian; Venetian also includes the "ç" character, and Piedmontese also includes the "ë" and "ò" characters. Limburgish is similar to German. It may be possible for these languages to be handled by the Google Vision Latin script model, but it may mistakenly map them to Italian and German automatically, causing errors.

Min Nan is a Romanized version of Chinese used in Taiwan. It has a complicated diacritics system to mark tone, so a simple Latin script model wouldn't work. Currently, though, there are zero text scans on the Min Nan Wikisource so they don't actually need any OCR (for now).

Anglo-Saxon (aka Old English) is a closed Wikisource and doesn't have any text scans, so we don't need to worry about it.

Thanks @kaldari ! That's great.

I just had a thought about our goals for this work and maybe the role of OCR on WikiSource in general. That is, are we looking for exactness or simply something that might provide 80% coverage and the transcribers and verifiers fix up the rest? Given that WikiSource is basically a transcription effort, it seems like beating ourselves up over what might be a small change in accuracy is perhaps misguided.

Instead, maybe we should focus on making the workflow as fast and easy as we can. We could try to remove as many of the "do this on your local machine" steps and work more on efficiency rather than accuracy.

@aezell - Good question. The wishlist request mentions quality a couple times, but doesn't give a specific threshold for what level of quality is required. It seems some of the biggest issues raised in the wishlist request were dealing with diacritics and multi-column text. I think if we managed to tackle those issues and the uptime problem, the community would be more or less satisfied. Raw OCR accuracy would hopefully be improved incrementally by whoever was maintaining the service regardless.

My suggestion would be to create some test case images that include things like italics, diacritics, and multiple columns, and run them through the various services (in various languages) to see how they do. That should help us narrow down our options. It may turn out that we only want to go with a single service after all.

I created a new task for the engine testing at T246944. Not sure if it should be a sub-task of this or subsequent.

Another tool is https://tools.wmflabs.org/tesseract-ocr-service (but it doesn't work at the moment).

The easiest way at the moment to test Tesseract on Toolforge is probably to log in and run e.g. tesseract test.png tes-out -l eng.

I'm not sure what next steps we're ready for at the moment with this task, but are we going to build on the same codebase as the existing tool? If we are then we should move its tool-ws-google-ocr repository from Phabricator to Gerrit (because repos in Phab are deprecated). I was going to just do this, but it seems that we might want to rename it at the same time, to get rid of the 'google' in the name. Or is that a bit premature? Something like wikisource-ocr?

I think the next steps here are to test the existing IA APIs for conversion and see what features they might make available to us.

@Samwilson Would you be interested in taking this over and doing that investigation?

There is no OCR-specific API at the IA. The way it currently works is:

  1. a set of images is uploaded to a (new) IA item, in a zip file with a name ending in _images.zip;
  2. IA ingests this and runs OCR, adding a new files to the item such as examplename.pdf, examplename_abbyy.gz, examplename_djvu.txt, examplename_djvu.xml;
  3. the PDF has a text layer and is copied to Commons, has an Index page created on Wikisource, and proofreading can begin.

The ia-upload tool can be used to do the copying to Commons, and can optionally instead use the examplename_djvu.xml file to create a DjVu file and upload that. This preserves better metadata about the OCR (because PDFs don't really have a 'text layer' they just have positioned bits of text). IA used to create DjVus but stopped because it's not a widely used format (I think that's why).

The PDF is not an archival replacement for the original files, so doing things this way means that IA is a crucial part of the Wikisource transcription, not only because that's the only home of the original scans but also because those scans need to be retrieves when extracting illustrations etc. to upload to Commons.

If the IA OCR is not of sufficient quality, then the various gadgets are used to request OCR from Google, Tesseract, etc., one page at a time.

@Samwilson and @kaldari have figured out how to make the Google Vision API work in the vast majority of the cases that we are considering supporting.

Closing this task.