Page MenuHomePhabricator

Wikisource: Investigate adding support for bulk OCR to Wikimedia OCR [16H]
Closed, ResolvedPublic

Description

As a product manager, I want to know the options available for adding bulk OCR to Wikimedia OCR, so that we can potentially implement a massive upgrade to Wikimedia OCR that can benefit many Wikisource users.

Acceptance Criteria:

  • Investigate the general work required to add bulk OCR capabilities to Wikimedia OCR, so that if a user is using the OCR tool they can click an OCR button (potentially from the Index page) to OCR the whole book
  • Provide a general proposal for how this can be accomplished from a technical perspective
  • Provide description of risks and challenges of this work
  • Provide a proposal for how these situations can be handled:
    • If part of the book has already been OCR-ed page-by-page, what would be the possible options that the product manager could choose as recommended behavior (such as override all of the previous page-by-page OCRs, keep them and mix with newly OCR-ed pages, etc)
    • If pages have OCR-ed or manually typed, and they have also been proofread or validated, what options would be available for the behavior (such as keeping text, overriding text, etc)?

Event Timeline

ARamirez_WMF renamed this task from Wikisource: Investigate adding support for bulk OCR to Wikimedia OCR to Wikisource: Investigate adding support for bulk OCR to Wikimedia OCR [16H].Mar 18 2021, 11:50 PM
ARamirez_WMF moved this task from To Be Estimated/Discussed to Estimated on the Community-Tech board.
Samwilson added a subscriber: Samwilson.

The existing workflow for bulk OCR on Wikisource is generally something along the lines of: uploading a PDF/DjVu to the Internet Archive or Google Drive to get a text layer added (or doing it locally), and then uploading the resulting document to Commons. At that point proofreading can start, with the initial text of Page pages being extracted from the document and displayed to the proofreader.

We have a couple of broad ways we can help with this process: a) we can add a process that automates the OCR part of that, such that the file is first uploaded to Commons (without a text layer) and our tool takes it from there, runs OCR, and uploads a new version of it; or b) extend the existing page-by-page OCR process and add a button to Index pages that will go through all (empty) pages, run OCR, and add the resulting text to the wiki page (in the Page namespace).

  1. The first way is preferable, as it aligns most with the status quo, and is likely to have benefits beyond just Wikisource (i.e. anyone uploading a PDF to Commons can get it OCR'd). However, its UI is likely to be more involved. We would probably start by creating a web UI for it, similar to the existing one where the user gives a URL and a language. Later, we could add a Commons- and Wikisource-side way of launching the same tool. The OCR process could take a bit of time and so we'd need a way to indicate progress (possibly a job queue, and we'd upload the new file revision either as the user, or as a tool account). Adding a Commons-side button/dialog or similar is difficult because there's no extension that the code could live in; possibly we'd implement it as a Gadget, although it'd be preferable to not (for all the usual reasons such as i18n, testability, etc.). The Wikisource-side stuff we can add to ProofreadPage or All-and-every-Wikisource.
  1. The second way (and these aren't exclusive either; we could certainly do both) is conceptually closer to the existing WikiEditor toolbar button: it adds a button to the Index page that says something like "Get OCR for all empty pages" and the user then watches the progress of the tool via the existing grid of page numbers, as each of them would turn pink from white as the page text was added. Another approach to this page-by-page method would be to add an opt-in way for the user to indicate that they want OCR to be run automatically as soon as they open an empty page — this would keep the UI even simpler, with a cost of a slower load time on each page (how much slower, I'm not sure; oftentimes it's very quick, although as we get more into the complexities available with Tesseract this could change).

One of the most common bulk OCR workflows on Wikisource is to first upload to Internet Archive. This is a good system: if you've scanned a work and have a pile of PNGs, you zip these up and upload to IA, where within a few tens of minutes they're turned into an OCR'd PDF and various other derivative formats. Copying the output to Commons is reasonably easy with IA Upload (not to say there aren't bugs with that process!), and at that point proofreading can begin. The improvements we're hoping to make are all around the OCR part of this: we'll offer Google and Tesseract as alternative engines (with each having various tweakable options that should help with things like columns and mixed languages), but we'll not be touching anything to do with either the initial combining of scanned images into a single file, or (crucially) the later use of the original files to extract and crop illustrations (the PDF/DjVu illustrations are often lower quality, so one must go back to the original scans). These might mean that the IA workflow is still used, and so our OCR improvements won't be as widely beneficial as they could be. Ideally, our bulk OCR work will be a step towards a future optimal workflow that is as simple and high-quality as possible.

All up, I think a MVP for bulk OCR would be: to add to the web UI an option to upload a OCR'd version of any existing PDF.

We'd also need to figure out how to do this on Index pages comprised of single images, such as https://en.wikisource.org/wiki/Index:Lippincotts_Monthly_Magazine_51 . This should probably be done somewhere on an Index page.

It occurs to me that storing the OCR text and the Proofread text on Commons with the original image could actually become a very valuable dataset for investigating where OCR fails and help Wikisource at the same time. It would probably look something like this.

  1. Upload Image on Commons
  2. Run OCR on image.
  3. Save OCR to field "WikiOCR"
  4. Load "WikiOCR" to the Wikitext field of ProofreadPage extension
  5. When the state changes to Proofread -> Update Commons file with field "Proofread_Text" (Repeat after any future commit)

Then for each Image, we'd have a machine generated OCR to compare with a Proofread text. It would be possible to do a diff on the two and then compare them across thousands and later millions of examples. From this, we could sort the greatest errors. When pushing a fix to OCR, we'd be able to bulk test the fix by rerunning the OCR and seeing whether or not the total number of that error increased or decreased. To make the dataset even more useful, we'd probably also want to store language codes for each image.

For instance, we could have
File = Lippincotts Monthly Magazine 55 uiug.30112045355614 010.jpg
{
WikiOCR = {Copyright, 1895, by J. B. LIPPINCOTT COMPANY.
PRINTED BY J. B. LIPPINCOTT COMPANY, PHILADELPHIA, U.S.A.
Digitized by
Google
Original from
UNIVERSITY OF ILLINOIS AT
URBANA-CHAMPAIGN};

Proofread_Text = {Copyright, 1895, by J. B. LIPPINCOTT COMPANY.
PRINTED BY J. B. LIPPINCOTT COMPANY, PHILADELPHIA, U.S.A.}

Language = {en}
}

From the diff, we get a +1 for removing
"Digitized by
Google
Original from
UNIVERSITY OF ILLINOIS AT
URBANA-CHAMPAIGN"

from the WikiOCR output.

We'd also need to figure out how to do this on Index pages comprised of single images, such as https://en.wikisource.org/wiki/Index:Lippincotts_Monthly_Magazine_51 . This should probably be done somewhere on an Index page.

Good point. I think the Index page button that I mention above would work for that. Perhaps we should start with that approach, before the whole-file one.

It occurs to me that storing the OCR text and the Proofread text on Commons with the original image could actually become a very valuable dataset for investigating where OCR fails and help Wikisource at the same time.

That feels like it's out of scope of this task and probably this (CommTech) project in general. A good idea, but a big undertaking and one that'd need careful designing.

There are two major and related questions to answer here. First, when should the OCR tool be run? Second, how should the result be stored? While it’s tempting to run the OCR on the Index page, OCRing an entire book takes a considerable amount of time during which the user cannot edit wasting valuable user time and potentially resulting in the user leaving. Furthermore, it’s not actually necessary to wait that long because OCR can be performed earlier. When OCR can be performed depends on how the individual scans that make up a book are stored. These are the major options that I can think of:

Approach A) Individual image files on Commons

Pro:

  1. OCR can be performed immediately upon uploading,

Cons

  1. Trying this out has brought howls of protest from Common and Wikisource admins and long-terms users with lots of generous offers to help so that I can stop making everybody’s life harder.
  2. Managing a project becomes an absolute nightmare. Anything from updating metadata to renaming files becomes a massive bot job,
  3. Ballons the number of files to review by a factor of several hundred to thousands. Instead of reviewing one page, patrollers have to review anywhere from 10 to over 2,000.
  4. Makes it extremely difficult to verify license because such information is not preset on most pages.

Approach B) Zipping an entire site of files into a zip/cbz/7z/rar/etc. Not viable due to security concerns,

Approach C) Storing the files in a PDF.

Pros

  1. All images are grouped into one file that has good oss support.
  2. Well understood

Cons

  1. An extremely complicated format that we’re essentially using as a glorified zip file.
  2. Creates an unnecessarily complex chain: original book -> images -> PDF -> Commons -> decomposing to individual images for OCR -> compiling to PDF -> extracting individual image for usage of a ProofreadPage Page
  3. Lots of current bugs in Phabricator.
  4. A failed upload causes the user to have to start from scratch making this more difficult for users with slow or unreliable connections.
  5. Adds unnecessary bloat,

Approach D) Design a format for multi-image page on Commons

  1. Allow for the grouping of scans into one logical unit.
  2. Allow for the avoidance of PDF saving a considerable amount of time for users. This would make the chain Book -> Images -> Common Upload -> OCR performed as soon as individual importation of an individual image is done -> Importation in Wikisource.
  3. Easy to patrol, modify, or verify license status,
  4. Faster load times on Wikisource because images are directly accessible.
  5. Easier integration into Wikimedia.
  6. Can be useful for other contexts,
  7. Wouldn’t create unnecessary work for administrators, patrollers, and other volunteers
  8. More sustainable for the future.
  9. Less of a security risk than zip files.
  10. Would simplify importation from sites such as Haithitrust or IA.
  11. More robust in case of failure because only the already uploaded files would not need to be reuploaded.

Cons

  1. Requires more engineering. The format will probably have to be something like this

{
Author. =
Title =
Blah blah

All the standard good stuff that we include already on a media page

}

Filelist =
{
1: { // Sequence number that dictates how the files are order

 Filename = Name of the individual image
 OCR = output of OCR
 Language = optional parameter to set the OCR language
 OCRVersion = Records the release information of the OCR program
}

....
N:
}

  1. Common Uploader and Pattypan would need to be updated to handle this.

As to your response to my second comment, I don’t necessarily think that we should analyze the data, but I think that it would make an extraordinary valuable one to collect for the designers of the OCR software. So, it might make sense to think a bit about how much work would it be to create this dataset. It seems to me that it would only require storing the raw OCR, the corrected text, and the OCR software version. Then the designers of the OCR software can take over and perform the analysis and write the tools for that.

There are two major and related questions to answer here. First, when should the OCR tool be run? Second, how should the result be stored? While it’s tempting to run the OCR on the Index page, OCRing an entire book takes a considerable amount of time during which the user cannot edit wasting valuable user time and potentially resulting in the user leaving. Furthermore, it’s not actually necessary to wait that long because OCR can be performed earlier. When OCR can be performed depends on how the individual scans that make up a book are stored.

The first goal of our work here is to unify the existing toolbar buttons, and provide a way to insert a page's OCR text into the wikitext edit box. So, that much I think is clear — the OCR should be run on demand, and the text stored in the page. The second part to tackle is the bulk OCR. In which case it seems sensible to have a system that builds on the API we build for the first use case. How about the Index page operation works something like this:

  1. user clicks an OCR button on the Index page;
  2. they chose any configuration options (same as the per-page button: what engine, languages, PSM, etc.);
  3. a job is lodged with the tool;
  4. the user can then navigate away at any point, including to edit the pages that are about to be OCR'd;
  5. when the job runs, it loops through to the first non-existent page, and OCRs it;
  6. the OCR text is inserted into the page (only if it's a new page).

This would mean that multiple people could start the bulk OCR process without any problem, and it wouldn't clobber any work that's already been started. It'd also work with individual-image Index pages as well as PDF- or DjVu-based ones.

The main downsides are I think:

  • it becomes easy to populate Page pages with text and then never clean them up (making search very annoying, among other things);
  • it's not possible to add the text layer to the PDF/DjVu/etc. (I don't think this matters very much, because the text that we care about and that this whole project is built around is the proofread text on Wikisource).

I agree with your points about the different means of storage on Commons, but the reality is that we're not able to move away from a mix of all of these, and this project really isn't about that anyway.

Design a format for multi-image page on Commons

Does multi-page TIFF suffice for this?

@Samwilson I think the current OCR tool will read ahead in the current file and OCR the other pages in the background and cache the results, on the assumption that if you want one, you or others will want more. But I'm not sure how far ahead it goes.

I'm not sure we want to bulk-dump the OCR into the page. Caching it server side and allowing the user to insert it with the OCR button (or even automatically according to a config, maybe on a per-index basis) would avoid the problem of spamming thousands of red pages comprising raw OCR (which is already somewhat out of hand at enWS)

I don't think writing the OCR back into the files is really useful because

  • that'll be lots of updates to the file histories and can't really be viewed as a diff
  • the generated OCR might not be better than the original (IA OCR is generally not bad, for a start)

Perhaps, if we actually wanted to make this accessible, various OCR results (e.g. by lang/model) could be inserted into a slot of the file page? I have no idea how slots work, but they sound good!

There's also T59807, which is one step further and is the (more useful?) idea of merging the proofread/validated text back into the DjVu or PDF. Which is a fairly technically hairy prospect.

  • it's not possible to add the text layer to the PDF/DjVu/etc.

This tool explicitly shouldn't do that, because there may be reasons to preserve the original and editing the Commons files has too many implications. It may make sense to do so, but in different contexts / parts of the workflow. For example, I think the part of the workflow where ia-upload currently sits could beneficially have functionality for adding / creating / replacing a text layer (possibly coming from individual scan images, but possibly also from a container format like PDF/DjVu) that will end up in the file on Commons. There is also a possible future state where the output of the ws-export stage of the workflow—possibly created automatically on some "work is finished" trigger—emits the corrected text layer back to the source file. But that latter one will take a lot of thinking to do right, and will probably only really make sense in combination with other pie-in-the-sky features (like feeding back corrected OCR for training purposes).

Does multi-page TIFF suffice for this?

There's a reason multi-page TIFF is very little used after the bottom dropped out of the market for fax machines: it provides very few benefits and is hard to work with unless you already have a TIFF-based workflow. You really don't want to see what a thousand page TIFF file looks like…

I think the "container" need for an otherwise image-based solution here looks a lot like a Wikisource Index: page or a BookMaker Book: page, except with a more sensible technical implementation (something SDC-ish maybe). Conceptually a metadata-ish construct that collects and orders (and possibly adds other metadata to) otherwise ordinary File: images. Combined with either a MCR-based content slot for a text layer (of which there could be multiples: one for each lang Wikisource, plus things like Translation:), or something TimedText (or whatever that video-related work was called) equivalent where the text layer gets its own wikipage associated with but separate from the file.

In any case, it's entirely doable if sufficiently highly prioritized, but way out of scope for this project (and I would guess, also too wide-reaching for CommTech to tackle alone in any event). With all the supporting UI and functionality that would be needed it's a pretty tall order.

I think the current OCR tool will read ahead in the current file and OCR the other pages in the background and cache the results, on the assumption that if you want one, you or others will want more. But I'm not sure how far ahead it goes.

Yup. Phe's OCR tool, on receiving a request for a given page of a brand new work, launches a Gridengine job that pulls down the DjVu; runs OCR on it page by page; compresses it and stores it in a rudimentary (directory structure indexed by file hash) cache; and then returns the OCR text for the page requested. On the next request it pulls the OCR directly from the cache.

In fact, that's why it was dead for a year: some transient error caused it to store invalid data in the cache for a period of time, and when it later tried to read that data it choked. See T228594 for all the messy details.

It was designed a long time ago and with a primary focus on interactive performance, so it has an "optimization" where it prefers to return the original text layer in the DjVu (the reason it is perceived as giving better OCR results is mostly that it applies a couple of regex fixups to the text layer).

But for all that the approach is a sound one. Triggering bulk OCR on the first request for a page in a work is early enough (vs. doing it pre-emptively). If the first page requested is returned as soon as that page has been processed, and that page is processed first, the interactive performance will be "good enough" (comparable to the current on-demand OCR tools). Future sequential pages will most likely be fast because the OCR progresses faster than the proofreading. Random access to pages will be returned at on-demand speed until the whole work is processed, but that will be a trivial gap given how long it takes to proofread a whole work. On-demand OCR for a single page isn't actually all that slow either (a couple of seconds), so it's only a problem in aggregate.

Phe's tool caches these indefinitely, and that's probably just as well. It is common for works to be started, abandoned, and taken back up. Already having the OCR cached means new requests will be fast. The text also compresses well so having an infinite cache isn't going to be a big resource drain and will make the implementation simpler. Especially since you will in any case need some way to trigger regenerating the OCR for a work due to having multiple OCR engines and page segmentation modes.

the problem of spamming thousands of red pages comprising raw OCR (which is already somewhat out of hand at enWS)

Indeed. And that's why there shouldn't be a button on Index: pages to generate OCR for the work, and especially not one that pre-creates all the Page: pages. It is 100% certain that someone will go through every single Index: and press that button preemptively, thinking they are "helping out". Caching it server-side and only triggering at the first normal request for a page avoids that, hides the implementation from the end-user, and provides plenty good enough performance and functionality.

The only caveat I would add is that it would be nice to have an API and user-accessible toggles so that I could have a user script that auto-runs the request for OCR on page load whenever I open a new (redlink) Page: page. For some works where you are working through the pages sequentially, eliminating manual clicks has a dramatic effect on both perceived and real productivity. Iff you need to run OCR in a particular work you'll often need it on every page, and not having to wait for page load + move mouse and click + wait for OCR load + get to work on proofreading, is going to be a good optimization multiplied by the number of pages in the book. My own header pre-fill script works like that and it's a marvellous improvement. I wouldn't think that would pass a cost—benefit assessment for being in the CommTech developed tool, but it would be very nice if there were the hooks available for community tweaks like that.

The only caveat I would add is that it would be nice to have an API and user-accessible toggles

Just a mw.hook("ocrtool.buttonadded").fire() would be enough, and the user JS can click it when the button appears. Or if it comes in on the initial page load, you can do it with $( function() { $( '#ocr-button' ).click(); } );.

I think it’s important to point out that PDF does not always work especially for larger files. A single page can range anywhere from 500kb to over 35mb. When you factor in the number of pages in a work, you can easily get a PDF that is over 1gb in size. Currently, only Chunked Uploader can potentially handle that. However, I tried uploading a 1.20gb PDF over 10 times and failed even with async unchecked. Even going to the stash failed. Now, I’m not calling for removing support for PDF and I know that this is not entirely in scope for this project. However, if we’re discussing how to bulk store OCR, we also need to make sure that users can upload files to OCR. Even Fæ cannot upload some PDF from IA. So if we are to support bulk OCR, we’d either need to compress PDFs to death like IA does, improve Commons upload to robustly support files of several gb even when dealing with failed or unreliable connections, or develop a container as Xover commented on. A container is a larger project, but one that can have widespread benefits. For example, it would be possible to upload the front-and-back of a coin as a single entry.

I’m also not sure that storing OCR separate from the actual image is a good idea. It killed the Phe bot and we’d have to be careful to avoid a similar fate. It would create millions of txt files that we’d have to manage. If we did this on a PDF, Commons would still need some system to manage the hundreds of files that result from the OCR of the hundreds of images in a single PDF. If we don’t want to remake the original PDF, then some kind of container is needed to group the PDF with its OCR.

I second that auto-filling and publishing an OCR is a terrible idea. While it’s great to have OCR presented when first editing a text, published OCR creates a headache.