Both Google Cloud Vision API and Tesseract allow for specifying multiple languages when processing an image's text, to help make the OCR more accurate.
- Google: https://cloud.google.com/vision/docs/languages
- Tesseract: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html (installable as individual Debian packages, many (all?) of which are already installed on Toolforge)
Only Tesseract provides a dynamic means of retrieving what languages are supported. For Google it's just a list on the above page.
Currently, we just use a Wikisource's content langauge as the language, but this is not optimal for pages with multiple languages nor for Multilingual Wikisource.
The language codes for the two engines differ, so we'll have to map them to some sort of common system.