(In T269818 experimental support for selecting a region of an image for OCR was implemented on a test service, In the relevant ticket I asked about supporting multiple regions and was asked to open a new ticket.)
In some works such as the examples in the Use cases given, the generated or externally supplied text layer, contains text where the contents of portions of the layout in the original have become "interleaved", as the OCR tool is not necessarily aware that the original is a columnular, tabular or 'block' based in nature. It is time consuming to de-scramble such text manually.
Feature summary (what you would like to be able to do):
Produce higher quality non-interleaved OCR for works such as those given in the use cases below.
Whilst improved automatic detection/segmentation for images with text layouts such as those given in the use cases (an upstream issue?) would be appreciated, It's my view that some kind of user assisted region selection/transcription is needed in the Advanced options.
Multiple region selection/recognition/transcription could then be done either as:
- Single (sequential) region selection/ processing as at present, but with the option to append the transcription for that region at the start, end or defined insertion point (such as a user cursor) within existing text, or that previously transcribed; or
- Multiple region selections(by a user) which are then processed in an order defined by a list in the UI presented.
Steps to reproduce (a list of clear steps to create the situation that made you report this, including full links if applicable):
- https://en.wikisource.org/w/index.php?title=Page:Cary's_New_Itinerary_(1819).djvu/682&action=edit (and other pages in the same work)
- Note that the externally provided OCR text layer, has depending on the options interleaved the columnar/tabular text.
- Select Transcribe Text, Advanced Options.
- Transcribe.
The resultant transcribed text is not always aware of the multicolumn/tabluar nature of the content presented, and can sometimes generate 'interleaving' of content in the two columns.
Use case(s) (describe the actual underlying problem which you want to solve, and not only a solution):
Generate improved non "interleaved" OCR, for works with differing layouts at wikisource such as :
- https://en.wikisource.org/wiki/Index:Cary%27s_New_Itinerary_(1819).djvu - (multi columnular tabluar layouts)
- https://en.wikisource.org/wiki/Index:Ruffhead_-_The_Statutes_at_Large,_1763.djvu (side by side content for original/"translated" material with sidenotes.)
- https://en.wikisource.org/wiki/Page:Wisconsin_Rapids_directory_(1921).djvu/40 (2 columns with block advertisements in differing orientations.)