Page MenuHomePhabricator

Javanese OCR installation steps for Wikisource
Open, Needs TriagePublic

Description

Dear all,

WMID and Universitas Kristen Duta Wacana in Yogyakarta, Indonesia, are in partnership to develop OCR system for Javanese characters. This will help people to transcribe text, which in Javanese characters, easily in Wikisource.

They have already developed a closed software system, to read the characters automatically and produce the results which can be copy and paste. The software is still apart from MediaWiki software. That means, people will have to upload the pages to the software, wait to be processed, and copy the OCR result to Wikisource. Until now, that is the workflow I know.

They want to open their code to Wikimedia projects under GNU GPL v2, however, we do not know how to port the code into the Wikisource platform.

Event Timeline

Amire80 subscribed.

Hi @Rachmat04,

Thanks for creating the task.

I'm trying to add some relevant tags. Hopefully it gives it the right direction.

In theory, if the software is GPL, we can probably have our own instance of it, and create a service that reads it, but I'm really not an expert in how OCR and services work.

The software is still apart from MediaWiki software. That means, people will have to upload the pages to the software, wait to be processed, and copy the OCR result to Wikisource. Until now, that is the workflow I know.

Community-Tech has worked on an OCR gadget for Wikisource for Indic languages in the past, to use Google's Vision API (as far as I understand that).

Please see T142770#2639235 and T120788 for more information. I'd hope that an implementation could be shared.

Niharika subscribed.

Unfortunately Google does not support Javanese for its OCR service (also see T140037#2521503). We ran into this issue with an Indic language as well.

If and when Google supports it, this task should be a simple matter of importing the gadget we made (See example) on the wiki.

Unfortunately Google does not support Javanese for its OCR service (also see T140037#2521503). We ran into this issue with an Indic language as well.

Thank you; however, we would not use Google API or something similar to do OCR service for Javanese as far as I know.

The software developed by the university has its own system. I have not yet tested the software nor do I have its documentation at the moment. But I would love to request if anyone is interested.

Unfortunately Google does not support Javanese for its OCR service (also see T140037#2521503). We ran into this issue with an Indic language as well.

Thank you; however, we would not use Google API or something similar to do OCR service for Javanese as far as I know.

The software developed by the university has its own system. I have not yet tested the software nor do I have its documentation at the moment. But I would love to request if anyone is interested.

Oh, that's interesting. Do you know if the service has an API that can be used to fetch the OCR text?
The code does not necessarily have to live on the Wikimedia site. We can use the API for the service (If it is running somewhere outside) and use a gadget to automatically paste the OCR text in the editor. This is similar to how the gadget I mentioned above works.

@Rachmat04: Do you know if the service has an API that can be used to fetch the OCR text?

@Niharika @Aklapper Hi! I am sorry for the lack of updates regarding this project. The process of annotating the scriptures was very slow, so for the last three months we have been working on increasing the labelled data by crowdsourcing the process. Now that the training data is ready, the prototype is ready, we're back on track on developing the software and I'd like to catch up on the issue of integrating the software so it can be used with Wikisource.

Since we are still making improvements here and there with the software, we decided not to put it directly in the WMF Labs/ Wikimedia site for now, we will put the software on our own server. For your information, the program is built by using Python 3 and PHP, and the database for the annotated label is in SQL (we are planning to make it to XML in the later phase so that it can be used easily by everyone outside this project).

The program will take images as an input, and it will give plain text as the output. Would you please explain to me how I can fetch the images from each page which is opened by a user in Wikisource (so that it can be used by our software)? Is there any documentations about this that I can read?

As for the API to fetch the OCR text to Wikisource, we haven't build it yet, so for now it just give a display of the OCR to the user and the user has to copy-paste the result by themselves to Wikisource interface. Do you have any ideas for the best practices to implement this? Do we need to put authorization layer of Wiki account so that the OCR text can be fetched directly to the Wikisource interface?

Thank you very much for your help!

Would you please explain to me how I can fetch the images from each page which is opened by a user in Wikisource (so that it can be used by our software)? Is there any documentations about this that I can read?

Sounds to me like a gadget (list of gadgets on English Wikisource) could be written that adds a link to the sidebar (or such) on Wikisource to allow the user to send the image to your tool? In my understanding running this on your own server instead of using Toolforge or CloudVPS could make you run into CORS issues and potential privacy policy issues. A developer may please correct my half-knowledge if I am wrong here.

Do you have any ideas for the best practices to implement this?

Does the service has an API that can be used to fetch the OCR text?

Do we need to put authorization layer of Wiki account so that the OCR text can be fetched directly to the Wikisource interface?

Not sure if I understand correctly... OAuth might be related here?

Does the service has an API that can be used to fetch the OCR text?

We are developing the API for it, but have no idea on how to connect to the Wiki server.
Is there any guidance for sending and requesting data to and from Wiki server? Do we need to have specific account?
I would be very grateful for your information on having connection to Wikiserver.

Is there any guidance for sending and requesting data to and from Wiki server? Do we need to have specific account?

@dwika: I'm sorry for my belated answer. Please see https://www.mediawiki.org/wiki/API:Main_page for how to use the MediaWiki API.
If you shared more technical details it would be easier to be more specific and supportive...
Please see https://discourse-mediawiki.wmflabs.org/ for general development questions.

@Rachmat04 / @dwika: Any specific problems that we can help with / technical details that you could share?

@Rachmat04 / @dwika: Any news to share? Is any of you working on this? If yes, do you need any help?

Our problem is how to save a UI code in WikiSource server? The engine runs well on a rented server and I guess the principle workflow is the same only running on different server. The problem is that we do not know the address of WikiSource server, or an access to put (write/save) our UI (API) code to that server. It seems trivial but it turns out complicated if we have no access.

Our engine called Cakra OCR has been running in our experimental server https://trawaca.id/ocrjawa/?lang=en with a working example shown below:

chrono_02.png (1×1 px, 935 KB)

An API has also been set up, although we are still fixing bits of it. We've tested the API from an external server, for example from our friend's https://ferianto.id/twclient/sampleajax.php and it worked in receiving JPEG images of Javanese manuscript and returning the OCR result in Unicode.
We've stuck in how to make interface like this within the Wikisource:
chrono_03.png (1×1 px, 295 KB)

where window A will send the material through Cakra API and the resulting OCR will be send into window B, just like in our server.

Thanks for the update. https://wikitech.wikimedia.org/wiki/Portal:Toolforge explains how you could host this on Wikimedia Toolforge, if the source code of the Cakra OCR engine is open source. Could you provide a link to the source code of the Cakra OCR engine?

We've stuck in how to make interface like this within the Wikisource:

chrono_03.png (1×1 px, 295 KB)

where window A will send the material through Cakra API and the resulting OCR will be send into window B, just like in our server.

@Mahastama There are typically two components to this kind of OCR system: 1) the OCR engine itself, hosted on a server somewhere and providing an API that can be called over the web, and 2) a javascript triggered by a button in the Wikisource editor window, that sends the URL of the page image to the OCR server and gets the extracted text back.

For the server component, if you don't already have a suitable server or don't want to host such a server, you can ask for web hosting on Toolforge (see the link Aklapper provided above). Toolforge is a semi-official shared hosting facility for various tools used by volunteers on Wikimedia projects.

For the interface that's visible to end users, this will consist of a javascript triggered by a button in the text editing window. Such javascripts can be stored in several locations on Mediawiki wikis, for different purposes and with different ramifications for necessary permissions, ease of use, policy, and so forth. For getting started testing you can just make a script as a sub-page of your user page on the wiki. For example at https://id.wikisource.org/wiki/Pengguna:Mahastama/OCR.js (or another sensible name).

Mediawiki automatically runs javacript statements placed in various pages, some globally for the whole wiki and some just for your user account (and some only when you use a specific user interface skin). Once you have the script you can tell Mediawiki to load it for your user account by putting importScript('Pengguna:Mahastama/OCR.js'); into a subpage named common.js in your user account (i.e. at https://id.wikisource.org/wiki/Pengguna:Mahastama/common.js). Mediawiki will load your common.js, and it will in turn load your OCR.js script.

The "OCR.js" script will contain code to add a button to the wiki editor toolbar; attach an event handler that will call a function in your script to that button; to extract the URL of the page image from the current page; send that URL to your OCR server; receive the returned text; and replace the contents of the editing box with that text. You can see an example of a script that does this at https://wikisource.org/wiki/MediaWiki:GoogleOCR.js.

Once you have something that works there are various ways to make it available for other users. One is to just have all other users manually load the script the same way you did (other users can load the script in your user page but can't modify it). The easiest for end users to use method is to turn it into a "Gadget", because gadgets can show up in their preferences as a simple checkbox to enable. This has some policy implications for the wiki on which it is going to be used, and some for our global privacy policy, so you need to be prepared for some discussion etc. at that point.

PS. Community-Tech will be working on a related system this year (not specifically for Javanese, but an OCR system of the same general shape), so it is possible they will want to discuss opportunities for collaboration with you. I believe their idea is to make an OCR utility that can use multiple backends, and it is possible your system could be one of those backends. If that's the case you may not need to worry about making any separate front end / user interface for this.

@Mahastama I have been bold and just created a beginning of script here: https://id.wikisource.org/wiki/Pengguna:Mahastama/OCR.js
I have put it in one of your user subpage to allow you to edit it. I hope it is fine for you.
It adds an "OCR" button to the page pages and calls the Trawaca API just like you presented.
However, currently the Trawaca API fails with an "authorization" error.
May you or one of the OCR developer have a look at it?
To reproduce it, you have to load the script (just like explained by @Xover, and try to run the OCR on any Page: page.

Thanks guys for the enlightment @Aklapper, @Xover and @Tpt :) Me and my team have been trying for months to find the way, because we are unfamiliar with the Wikisource site structure (nobody ever worked using this before). Many thanks for @Tpt for making a working example; that's exactly what we need. Sorry for the "authorization" error, as one of my developer happened to change the API key, I'll take a look at it.
Well, if the tech community is considering to make a common interface for calling external OCR API's, we'll be glad to collaborate.

Thanks guys, it's been up and running for the initial test. Apparently the module which we use to accept cross-domain API request has a very short valid period for each API key generated and caused the error mentioned before. We temporarily disable the validity period check at the moment.

FireShot Screen Capture #076 - 'Creating Halaman_PDIKM 700-09 Majalah Aboean Goeroe-Goeroe September 1931_pdf_3 - Wikisource' - id_wikisource_org.png (654×1 px, 222 KB)

The textbox is already showing the response from the Cakra OCR API. However there are some things still in question:

  1. We haven't tested it yet using a Wikisource document with Javanese script. I am asking the ID crew to point an example of available file.
  2. The sent data from the page is apparently in BLOB data type, while the API is designed to receive a JPEG file. We might have to take a look at the process and the data sent, if this will raise an issue on our side or not.
  3. The OCR result in Unicode jv is contained inside the span tag. Do we have to provide a response in plain text or is it possible to circumvent this so that only the result from inside the span tag is displayed on the textbox? The blank return itself is still within investigation, on whether it is caused by no Javanese script in the document or because the sent data is a BLOB.

https://id.wikisource.org/wiki/Pengguna:Mahastama/common.js seems to always load https://id.wikisource.org/wiki/Pengguna:Mahastama/OCR.js which makes a request to 3rd party host trawaca.id (which might be a Privacy concern here)?

Can someone provide a clear list of steps (step by step, click by click, as a list) how others can test this functionality?

I went to https://id.wikisource.org/wiki/Istimewa:Preferensi#mw-prefsection-gadgets and enabled OCR: Aktifkan tombol OCR Button dalam ruangnama Halaman.. But when I edit a page and press the "OCR" button in the toolbar of the editor, it tries to contact phetools.toolforge.org instead.

Can someone provide a clear list of steps (step by step, click by click, as a list) how others can test this functionality? See my previous comment. Thanks.

Dear @Aklapper , I am sorry for the late response.

Trawaca is the name of the team behind the Javanese OCR development, which consists of @dwika and @Mahastama . https://trawaca.id/ is a site which serves as the hub of the project (for annotating training data, Javanese learning application, Javanese OCR that is being implemented in Wikisource, etc.).

At first, we thought that the OCR script should be embedded in Indonesian Wikisource (https://id.wikisource.org/), but after we discussed it, we think that it is better to put the script in Wikisource.org (https://wikisource.org/), since even though most of Javanese speakers are Indonesian, but Javanese itself is a different language, and currently, it doesn't have its own Wikisource.

I copied the code from https://id.wikisource.org/wiki/Pengguna:Mahastama/OCR.js into https://wikisource.org/wiki/User:Fexpr/OCR.js. I only changed the icon part so that it will use Trawaca icon. The reason for this is to make sure that user won't be confused by other OCR icon, (because the recent version of the code in Mahastama's part still uses the Google OCR icon as its icon).

So, if other user wants to test its functionality;

  1. Make a User:<USERNAME>/common.js page, e.g.: https://wikisource.org/wiki/User:Fexpr/common.js
  2. Add
importScript('User:Fexpr/OCR.js');

into User's common.js page.

  1. Open a file that contains Javanese scripts (for example, this one: https://wikisource.org/wiki/Index:Serat_Babad_Surakarta_Volume_4.pdf) , and then after that, click a page that you want to proofread (for example, this page: https://wikisource.org/w/index.php?title=Page:Serat_Babad_Surakarta_Volume_4.pdf/7&action=edit&redlink=1).
  2. See that there is a brown button in the menu bar, that shows an image like this: https://commons.wikimedia.org/wiki/File:Trawaca_logo.png
  3. It will send the image blob into Trawaca server in https://trawaca.id/ , and then Trawaca server will send back the resulting character.
  4. Currently, the resulting characters are still in the JSON format, and the team is trying other workaround so that the characters that will be sent to Wikisource are formatted in Unicode.

If you want to test the OCR in the Trawaca page, you can go to this page https://trawaca.id/ocrjawa/ and uploaded a .jpg of Javanese scripts (preferably in high resolution). Click on the button "Pilih..." to upload image and then Submit. After that, the Javanese scripts in unicode will be shown at the right side.

For the backend part, @Mahastama is in charge of Trawaca server, API's, and how it communicates with Wikisource.
@Mahastama, would you add more details regarding this issue?

Thanks for the explanation - very appreciated! :)
https://wikisource.org/wiki/User:Fexpr/OCR.js calls var toolUrl = "https://trawaca.id/api/default/ocrjawa";.
I'm wondering if calling a third-party, external website without a warning might be seen as a Privacy violation of users. :-/

Adding Privacy Engineering per my last comment to get some guidance (please remove this tag once provided) - thanks in advance!

Hi @Aklapper ! I'm sorry for the late response. @Mahastama had already added a warning box for user before sending the image to trawaca.id, is it already okay for now or is there any other way that we can do in solving this privacy issue?

image.png (1×1 px, 945 KB)