User Details
- User Since
- Jul 25 2022, 11:58 AM (199 w, 10 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Stefan Weil [ Global Accounts ]
Tue, May 5
Both models are now installed on https://kraken-ocr.wmcloud.org/. Maybe someone who can read Arabic can test it.
Mon, May 4
After the upgrade to Debian trixie, the next step is upgrading wikimedia-ocr and kraken. This is currently not possible because the 20 GB disk space is not sufficient, even after removing most log files and lots of Debian packages (including Tesseract OCR). I think that a total capacity of 40 GB would be needed for a full installation with several kraken models.
A recreate with Debian trixie would be fine for me. Maybe this is easier than fixing the current installation.
Sun, May 3
I'm afraid that I caused this issue with an update to Debian trixie. Maybe puppet was uninstalled accidentally. I cannot fix it, because the VM no longer accepts my SSH keys, and I have no other access like for example a VNC console.
Jun 26 2024
Meanwhile german_print is the best model from UB Mannheim.
As far as I know only Tesseract (and Kraken as soon as it is available) can produce hOCR output. Transkribus can produce PAGE XML which could be converted to hOCR.
For Tesseract there exist language models, script models and models which are neither for a single language nor for a single script.
Nov 1 2023
Then let me rephrase my bug report: Wikipedia uses incorrect language attributes if a non-default user interface is selected.
Oct 25 2023
Oct 2 2023
Right, but the change would only affect users who are logged in. So it is a fix for Wikipedia authors, not for the majority of "normal" users.
The main content of the page uses the language of the selected Wikipedia, French for fr.wikipedia.org, German for de.wikipedia.org and so on. As long as the HTML tag specifies that language, translation programs will translate that content.
Sep 28 2023
Menwhile Kraken is installed and configured, the web service is online.
Sep 20 2023
Merci bien.
Sep 19 2023
@Samwilson, it looks like Wikimedia OCR currently does not handle more than a single OCR process at the same time. Is that correct? Doesn't that cause much waiting if the service is used heavily? Did users complain about slow OCR because of that?
Sep 18 2023
The current implementation offers 3 different models for the text recognition.
Is there a need for non Latin scripts as well? Which ones? Arabic? Hebrew? Others?
A virtual machine for tests with kraken should provide at least 4 VCPUs, 8 GiB RAM, 8 GB storage (minimum values). More VCPUs allow more parallel processing.
Aug 30 2023
Temporarily disabling this check for PHP 8.2 seems like a good idea to me. After all, it will still be run for PHP 8.1 and older versions.
Aug 28 2023
You could use only the models from tessdata. They support the legacy OCR engine and include a fast model (derived from tessdata_best) for the LSTM OCR engine.
Aug 16 2023
Maybe it would be better to skip Bullseye and directly go to the current stable Bookworm?
Meanwhile Debian Bookworm is the current stable version. It comes with PHP 8.2.
Aug 13 2023
A Tesseract model can be trained either with artificial data (which requires Old English texts and fonts) or with real page images and matching transcriptions. Do you have such data? If yes, I could try to train a Tesseract model.
Jul 25 2022
Don't hesitate to ask me (the author of the mentioned models) if there remain any open questions.
