Rationale: One of the assumptions of Wikisource is that image and OCR will come as a single file in either a PDF or DJVU file. This can not be the case in several scenarios:
- The Image file and text are provided as separate files.
- The original image file may be damaged or of inferior quality.
- A proofread text with page numbers may exist at an external site, such as Project Gutenberg.
Solution: Develop a tool that would allow the OCR or Image to be loaded from a separate file.
Requirements:
- Allow users to select one file for images and other for OCR. The tool should match the individual images to the text files in a visual layout similar to Book2Scroll.
a. In the simplest case, the text files would be a series of sequentially numbered files corresponding exactly to the image files, e.g. 1.txt … n.txt; 1.png… n.png
b. In a more complex case, users will need to set custom ranges to match the images to the OCR similar to the Pages tool on the Index Page for a book. <pagelist 1to2=skip 3="1" 4to8=skip 9="2" 415to420="skip" 416="400" /> In this case, images 1.png and 2.png have no corresponding text files, image 3.png corresponds to 1.txt, images 4.png to 8.png have no corresponding text files; image 9 corresponding to 2.txt and begins a sequence that runs until the next change; the text for image 416.img is 400.txt
c. The most complex case would be an html file with page numbers. The parser would need to be able to split the HTML into separate txt files, convert the HTML to wikicode, and then run step b. See, http://www.gutenberg.org/files/64649/64649-h/64649-h.htm as an example. There are about 41,000 files on Project Gutenberg with the page numbers marked with class="pagenum" id="Page_19"
- The tool should also use the same code to allow users to either replace the image files or add a second set of image files (while keeping the existing text) . This can help when a higher set of images becomes available or multiple versions are needed due to damage or illegible text. As special case would be an option to import the original files from IA when they are needed to extract images or illustrations. This is part of this proposal because it will reuse much of the same code.