The tool has two current OCR 'engines': Tesseract and Google Cloud Vision API. The former is used directly on the same web server that the tool runs on, and the latter by sending API requests to the external service. Integrating Transkribus will be similar to the Google service, but of course with a different API structure. This task is to determine what that API usage will look like. (Note that this task is not concerned with modifying the OCR Tool's own API.)
Initial ideas and questions:
- Transkribus API docs: https://readcoop.eu/transkribus/docu/rest-api/
- Developers can register their own personal accounts on Transkribus and get an API key to use during development. The OCR Tool will have its own production API key (and all users will operate via that).
- The tool has the following data available:
- A URL of the image, or the image itself. This is a scaled-down version generally about 1000px across. This may also be a pre-cropped part of a larger page.
- A language code, which we'll map to an existing Transkribus model (or small number of models?).
- Should all images be uploaded to the same collection? Will we delete them immediately after text-extraction is finished?
- How do we handle layout analysis? Can we submit layout regions at the time of submitting the OCR job? Do we need our own way of storing layout data on the Wikimedia side (e.g. similar to the Image-Annotator)?
- Do we expect the user to interact with the Transkribus UI ever? This seems unlikely as we're not expecting them to have their own accounts.