Page MenuHomePhabricator

Implement an Internet Archive-like digitalization service
Open, Needs TriagePublic

Description

Just as many other wikisource users I appreciate a lot Internet Archive digitalization service, and I use it as deeply as I can (djvu files being only one from many uses of the rich file set that can be downloaded: collection of high-resolution jp2 images and abbyy xml being really extremely interesting).

I'd like that mediawiki should implement a similar digitalizing environment, but with a wiki approach and a wikisource-oriented philosophy, to share the best possible applications to pre-OCR jobs of book page images (splitting, rotating, cropping, dewrapping... in brief, "scantailoring" images), saving excellent lossless images from pre-OCR work; then the best possible OCR should be done, with ABBYY OCR engine or similar software if any, saving both text and full-detail OCR xml; then excellent images and best possible OCR text should be used to produce excellent seachable pdf and djvu files; finally - and this step would be really "wiki" - embedded text should be fixed by usual user revision work done into wikisource.

This is a bold dream; a less bold idea is, to get full access to best, heavy IA files (jp2.zip and abbyy xml) and to build tools for use them as thoroughly as possible.

--Alex brollo (talk) 07:08, 11 November 2015 (UTC)

This card tracks a proposal from the 2015 Community Wishlist Survey: https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey

This proposal received 19 support votes, and was ranked #44 out of 107 proposals. https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Wikisource#To_implement_a_Internet_Archive-like_digitalization_service

Event Timeline

DannyH created this task.Dec 8 2015, 12:19 AM
DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH moved this task to Wishlist 51-on on the Community-Wishlist-Survey-2015 board.
DannyH added a subscriber: DannyH.
Restricted Application added a project: Internet-Archive. · View Herald TranscriptDec 8 2015, 12:19 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
IMPORTANT: If you are a community developer interested in working on this task: The Wikimedia Hackathon 2016 (Jerusalem, March 31 - April 3) focuses on #Community-Wishlist-Survey projects. There is some budget for sponsoring volunteer developers. THE DEADLINE TO REQUEST TRAVEL SPONSORSHIP IS TODAY, JANUARY 21. Exceptions can be made for developers focusing on Community Wishlist projects until the end of Sunday 24, but not beyond. If you or someone you know is interested, please REGISTER NOW.
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 21 2016, 2:51 PM
DannyH renamed this task from Implement an Internet Archive-like digitalization to Implement an Internet Archive-like digitalization service.Feb 6 2016, 12:06 AM
DannyH updated the task description. (Show Details)
DannyH set Security to None.
Yann added a subscriber: Yann.Mar 11 2016, 10:22 AM
Yann added a comment.Mar 11 2016, 11:00 AM

This is now more needed, as IA has stopped creating DjVu files. See https://archive.org/post/1053214/djvu-files-for-new-uploads