|Resolved||Tpt||T208711 Update the "tesseract-ben" package on Toolforge for OCR on Bengali Wikisource|
|Resolved||bd808||T215693 Upgrade Toolsforge Tesseract package to v4 on Stretch from backports|
- Mentioned Here
- T215693: Upgrade Toolsforge Tesseract package to v4 on Stretch from backports
T118643: Backport newer tesseract-ocr-* packages from sid
T199271: Upgrade the tools gridengine system
T117711: OCR scripts need updating at tools labs by updating the "tesseract-ben" package
T167566: Update the "tesseract-ben" package on Toolforge for OCR on Bengali Wikisource
- Jay to @Tpt
Can I know What is the Blocker for T208711, the package for Ubuntu 18.04 has https://packages.ubuntu.com/bionic/tesseract-ocr-ben. """ sudo apt-get install --only-upgrade tesseract-ocr sudo apt-get install tesseract-ocr-ben """ Will it is not enough for it? I want to work on Upgradation. Can you brief me the Workflow for Updation?
- @Tpt to Jay
The problem is that the Wikimedia ToolsForge , the hosting platform we are using for tools related to Wikisource, is still using Ubuntu 14.04. So, an "apt get update" is not going to magically solve our problem. A possibility is compiling the newest Tesseract versions on ToolsForge or using a VM  or an other hosting. What we should maybe do first is talking about it with WM cloud people .  https://wikitech.wikimedia.org/wiki/Help:Toolforge  https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS  https://wikitech.wikimedia.org/wiki/Help:Toolforge#Contact
- Jay to @Tpt
So I think we can't do anything. As per https://wikitech.wikimedia.org/wiki/News/Trusty_deprecation, WM Cloud team is moving forward to Debian Stretch (Version 9). Have you any migration plan for phetools?
- @Tpt to Jay
Yes, they are planning to switch to Stretch. But for the shared hosting environment (ToolsForge) "Concrete plans will be shared in the future, we are working on developing a timeline.". So, nowhere soon. I believe that Tesseract is also not provided in its latest version in Stretch. Building Tesseract ourselves is probably the easiest way to proceed if we want to stay on Wikimedia labs. If we host the tool somewhere else we could just install and use Ubuntu 18.04.
Toolforge will not be getting Ubuntu 18.04 servers at any point. The SRE team at the Foundation decided several years ago to move from Ubuntu to Debian for our servers. The good news however is that as mentioned in T208711#4780257 the Cloud Services team is actively working on a project that will bring Debian Stretch servers to Toolforge (T199271). With the current rate of progress I would expect that the new Jessie grid nodes to be available for use in early January at the latest (and hopefully several weeks before that).
Debian Jessie will be able to provide tesseract-ocr-ben 4.00~git28-f7a4c12-1~bpo9+1 easily (already in the backports repository upstream). If a newer version than that is needed we would need to do something like T118643: Backport newer tesseract-ocr-* packages from sid to get newer packages.