Page MenuHomePhabricator

Tool to Replace OCR or Image
Open, Needs TriagePublic

Description

Rationale: One of the assumptions of Wikisource is that image and OCR will come as a single file in either a PDF or DJVU file. This can not be the case in several scenarios:

  1. The Image file and text are provided as separate files.
  2. The original image file may be damaged or of inferior quality.
  3. A proofread text with page numbers may exist at an external site, such as Project Gutenberg.

Solution: Develop a tool that would allow the OCR or Image to be loaded from a separate file.

Requirements:

  1. Allow users to select one file for images and other for OCR. The tool should match the individual images to the text files in a visual layout similar to Book2Scroll.

a. In the simplest case, the text files would be a series of sequentially numbered files corresponding exactly to the image files, e.g. 1.txt … n.txt; 1.png… n.png
b. In a more complex case, users will need to set custom ranges to match the images to the OCR similar to the Pages tool on the Index Page for a book. <pagelist 1to2=skip 3="1" 4to8=skip 9="2" 415to420="skip" 416="400" /> In this case, images 1.png and 2.png have no corresponding text files, image 3.png corresponds to 1.txt, images 4.png to 8.png have no corresponding text files; image 9 corresponding to 2.txt and begins a sequence that runs until the next change; the text for image 416.img is 400.txt
c. The most complex case would be an html file with page numbers. The parser would need to be able to split the HTML into separate txt files, convert the HTML to wikicode, and then run step b. See, http://www.gutenberg.org/files/64649/64649-h/64649-h.htm as an example. There are about 41,000 files on Project Gutenberg with the page numbers marked with class="pagenum" id="Page_19"

  1. The tool should also use the same code to allow users to either replace the image files or add a second set of image files (while keeping the existing text) . This can help when a higher set of images becomes available or multiple versions are needed due to damage or illegible text. As special case would be an option to import the original files from IA when they are needed to extract images or illustrations. This is part of this proposal because it will reuse much of the same code.

Event Timeline

Rationale: One of the assumptions of Wikicommons is that image and OCR will come as a single file in either a PDF or DJVU file.

Hi @Languageseeker, I'm afraid I cannot follow... Is "Wikicommons" Wikimedia Commons, or is that some existing tool somewhere? How can an OCR come in a file? Isn't OCR a process to extract text from an image file? Could you elaborate on the problem to solve here, and provide more specific hints, please? Thanks! :)

Sorry, I meant to say Wikisource and not Wikicommons.

I'm using OCR as a broad term for the "conversion of images of typed, handwritten or printed text into machine-encoded text" My basic premises is that there may be other sources for the text besides the layer generated by a program in a PDF or DJVU file.

For instance,

  1. A museum or library can donate a manuscript with its transcription in separate text files. On Wikisource, we would want to retain the original image and match it with its transcribed text. This can be done manually, but quickly becomes very burdensome.
  1. I'm importing a file that Project Distributed Proofreaders is working on. Since the texts are in the public domain, I want to replace the unproofead OCR generated in the PDF with the proofread ones from PGDP. Once again, manually replacing several hundred pages of machine-generated OCR is quite burdensome.
  1. There is a text on Project Gutenberg from a specific edition with page numbers. I want to match the text with the original images on Wikisource. The tool will allow me to do so.
  1. I'm proofreading a text, but the bottom corner of page 5 was eaten by a goat. There is another copy of the book in which page 5 is intact, I wish to replace the goat-eaten version with the complete version without having to create a new project.
  1. There is a text that has lots of fading. A second copy exists, but there is also lots of fading, I wish to be able to flip between the versions to read the illegible text.
  1. I'm proofreading a text, but the PDF version of the text is quite bad. The book was printed with small font and the pdf made the text extremely hard to read. I wish to import the original files from IA without losing the OCR text from the pdf.
  1. I'm working on adding illustrations to a book and I want to use the original scan from IA instead of the lower quality PDF copy. Therefore, I request the tool to import the complete original file.

These are a few scenarios that come to mind. Let me know if I can make things clearer.

One of the most immediate benefit of this tool will be to allow us to import the full quality Internet Archive identifier_jp2.zip that contains the full quality images instead of relying on the lower quality PDF files. This will make proofreading easier. Also, using higher quality images may result in better OCR results as well.