Page MenuHomePhabricator

Add a "Bulk OCR" feature to Index Pages on Wikisource
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where):
Currently, there is no Bulk OCR tool available with which we can easily OCR all the pages in a given text(s). OCR4wikisource has stopped working for a long time.

I think it would be good to have an option to OCR all the pages in a given Index from the Index page itself.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
While the current feature works for many cases, there is still an extra bit of work that goes into OCRing every single page before proofreading it.

Benefits (why should this be implemented?):
This feature will be especially useful during campaigns or activities in which newbies are starting to edit Wikisource. They would not need to worry about choosing the correct engine, language but would only need to focus on the key task, that is, proofreading. This should speed up the proofreading work by reducing the redundancy of having to use OCR every time.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The tool should only OCR non-existing pages and not overwrite existing pages.

JWheeler-WMF subscribed.

Love this idea, however it's a feature optimization for OCR, and this feature is currently in passive maintenance.

One way to implement this would be to add the 'Extract text' button to the Index page (e.g. in the indicators' area, which is where the WS Export button is on other pages), and when it's clicked the non-existing pages in the pagelist could be processed one by one (or maybe three at a time or something small) and the progress would be shown to the user by the progression of redlinked pages changing to blue links with pink 'needs proofreading' backgrounds.

This would mean that a particular user would be associated with every OCR'd page that's saved; that explicit action would have to be taken in order for the bulk OCR to run; and that the user would have to keep the page open while it was running. That last point might be annoying after a while, but it also means that we wouldn't have to build e.g. a queue management system with some way of cancelling job etc. which I think could be more complicated.

I think it is generally a bad idea to bulk-write OCR to Page: namespace pages. It tends to create huge backlogs of very poor quality text that discourages many contributors from working on a text (enWS has a million-page backlog there, and the number is not decreasing). The text saved also becomes quickly out of date when newer models or OCR engines become available.

I would therefore suggest exploring more what the underlying motivation is for wanting this solution.

For example, on enWS one major motivation reported by those requesting similar solutions turned out to be that OCR takes too long to run (and that was before the orders-of-magnitude slower Transkribus became available), and they find it tedious to click the button for every page. This could instead be addressed by speculative execution and caching of OCR results (as Phe-tools OCR did) so that the OCR process becomes functionally instantaneous for the user. Possibly also offering a toggle to automatically run OCR (fetching from cache) on page load. This would also help a lot for communities whose major sources of scans do not include a text layer, so that the user would otherwise be met with an entirely empty #wpTextBox1.

From a user experience point of view, all the "overhead" for each Page: namespace page where the user is waiting for something feels like wasted time. That's why pre-loading the next page's page image, and EIS that switches book pages without a full web page reload, and without switching between read and edit mode, are such huge improvements to the user experience. It eliminates the "unproductive" time for humans, letting them concentrate on proofreading the text itself. Having to click and waiting up to a minute on OCR is infuriating by the third page; for a multiple hundred page text it's a wonder most people don't give up and go home.

Another reported motivation was to make large reference texts (dictionaries, encyclopedias, etc.) searchable even if just as raw text under a "better than nothing" theory. I personally think that is a very bad idea, because you can get raw OCR from Internet Archive and a million other sites etc. and serving it on Wikisource a) brings little advantage to readers but b) makes people view Wikisource as a place for poor-quality texts. But if a community wanted to do that, a better way to achieve it would be to make the existing text layer in the scan searchable. Possibly one could also make it possible to transclude non-existing Page: pages and get their respective text layers instead, as a sort of "virtual text" (preferably in a way that the community could automatically mark it as a "placeholder" or "temporary unprocessed" text).

Regarding pre-caching of OCR: we actually attempt to do this already. When the OCR button is clicked while editing, the following page is also requested and cached for an hour — the idea being that when you then go to the next page and click OCR it'll be nice and quick. However, it appears to not be working!

The point about dumping in raw OCR text for searching is more tricky, I think. I can really see why it's wanted. I know the text is available on other sites (sometimes… although hopefully more and more it's not because we host things that are no where else), but there's definitely something great about being able to search here, find something that's of use, and then be able to improve the quality of the transcription. The IA doesn't let you do that, nor do many other sites, and lots of people want to do it. For example, the Trove service for searching Australian newspapers lets you search uncorrected OCR and volunteer-transcribed text at the same time without distinction, and it has many people who want to help transcribe. For those sorts of works (general reference type things, rather than whole books for reading) often the corrections that are most needed are fixing up names of people & places, dates, and other things that people will be searching for. Confused punctuation or tho instead of the isn't really important for that use case.

So I think I'm coming around to thinking that a) we need to make the proofreading flow better as far as speed of OCR goes; but also b) that having the raw OCR dumped into pages is not always a bad thing, and that some works would benefit from it.

I'd like to add a point to the "overhead" comments -- while EditInSequence is intended to reduce much of that overhead, it's currently hampered by several bugs when using it to create pages (notably T340986, where the text layer doesn't appear, forcing the editor to OCR manually, and to a lesser extent T360282 where the index header/footer isn't loaded). I suspect caching OCR would be a nice option to add to EditInSequence, but these bugs ought to be fixed first IMO.