As a Wikisource user, I want the team to investigate the "New OCR tool" wish, so they can consider the various options and risks of this top Wikisource wish.
Background: In the 2020 Community Wishlist Survey, the #2 wish was "New OCR Tool." We may not be able to create a new OCR tool, but we can probably improve existing tool(s). The purpose of this spike is to investigate potential options.
Relevant Resources:
- 2020 Community Wishlist Survey: New OCR Tool
- OCR Tool Comparison
- Phabricator tickets: T161978, T228594
Acceptance Criteria:
- Review the 2020 wish & relevant Phabricator tasks (see links above)
- Review available OCR tools/options (i.e., OCR Gadget, IndicOCR, Google OCR, OCR4Wikisource, hOCR, Internet Archive)
- Investigate various options outlined in the All Hands brainstorms doc
- Determine if we can or cannot create a new OCR tool (so we know what to tell the community about this option)
- Determine if we can improve the existing tools -- and, if so, what the improvements could be
- Provide an analysis of potential risks associated with this project from a technical perspective
- Provide an analysis of potential dependencies associated with this project from a technical perspective
- Provide a rough estimate/sense of difficulty or effort required by this project