Page MenuHomePhabricator

Update the "tesseract-ben" package on Toolforge for OCR on Bengali Wikisource
Closed, ResolvedPublic

Description

A new version of Bengali Tesseract trained data has been released on March 22, 2018. This needs to be updated for Bengali Wikisource.

Previously it was done by @Tpt (vide T167566, T117711) and @Phe.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 5 2018, 10:13 AM
Jayprakash12345 added a subscriber: SrishtiSethi.EditedNov 28 2018, 7:57 AM
  1. Jay to @Tpt
Can I know What is the Blocker for T208711, the package for Ubuntu 18.04 has https://packages.ubuntu.com/bionic/tesseract-ocr-ben.
"""
sudo apt-get install --only-upgrade tesseract-ocr

sudo apt-get install tesseract-ocr-ben
"""
Will it is not enough for it? I want to work on Upgradation. Can you brief me the Workflow for Updation?
  1. @Tpt to Jay
The problem is that the Wikimedia ToolsForge [1], the hosting platform we are using for tools related to Wikisource, is still using Ubuntu 14.04. So, an "apt get update" is not going to magically solve our problem. A possibility is compiling the newest Tesseract versions on ToolsForge or using a VM [2] or an other hosting.
What we should maybe do first is talking about it with WM cloud people [3].

[1] https://wikitech.wikimedia.org/wiki/Help:Toolforge
[2] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS
[3] https://wikitech.wikimedia.org/wiki/Help:Toolforge#Contact
  1. Jay to @Tpt
So I think we can't do anything. As per https://wikitech.wikimedia.org/wiki/News/Trusty_deprecation, WM Cloud team is moving forward to Debian Stretch (Version 9).

Have you any migration plan for phetools?
  1. @Tpt to Jay
Yes, they are planning to switch to Stretch. But for the shared hosting environment (ToolsForge) "Concrete plans will be shared in the future, we are working on developing a timeline.". So, nowhere soon.

I believe that Tesseract is also not provided in its latest version in Stretch. Building Tesseract ourselves is probably the easiest way to proceed if we want to stay on Wikimedia labs. If we host the tool somewhere else we could just install and use Ubuntu 18.04.
Jayprakash12345 changed the task status from Open to Stalled.Nov 28 2018, 7:58 AM

We need Ubuntu 18.04, If Cloud team can help us, then It will be appreciated.

SrishtiSethi added a subscriber: bd808.EditedNov 28 2018, 7:21 PM

@bd808 Ping! I believe you might be able to help or advise @Jayprakash12345 on how to resolve the blocker here..

We need Ubuntu 18.04, If Cloud team can help us, then It will be appreciated.

Toolforge will not be getting Ubuntu 18.04 servers at any point. The SRE team at the Foundation decided several years ago to move from Ubuntu to Debian for our servers. The good news however is that as mentioned in T208711#4780257 the Cloud Services team is actively working on a project that will bring Debian Stretch servers to Toolforge (T199271). With the current rate of progress I would expect that the new Jessie grid nodes to be available for use in early January at the latest (and hopefully several weeks before that).

Debian Jessie will be able to provide tesseract-ocr-ben 4.00~git28-f7a4c12-1~bpo9+1 easily (already in the backports repository upstream). If a newer version than that is needed we would need to do something like T118643: Backport newer tesseract-ocr-* packages from sid to get newer packages.

Tpt added a comment.Nov 29 2018, 9:07 PM

@bd808 Thank you very much for having had a look at it! I believe that waiting for Stretch deployment is indeed the easiest way to go.

Stretch grid is now available. I have just opened a task to ask the upgrade of the tesseract packages to the latest version from Stretch backports: T215693

When it's done it should be easy to update phetools to use it.

Tpt closed this task as Resolved.Mar 12 2019, 8:21 PM
Tpt claimed this task.

Done a few weeks ago