Page MenuHomePhabricator

Document how the OCR tool will use the Transkribus API
Closed, ResolvedPublic

Description

The tool has two current OCR 'engines': Tesseract and Google Cloud Vision API. The former is used directly on the same web server that the tool runs on, and the latter by sending API requests to the external service. Integrating Transkribus will be similar to the Google service, but of course with a different API structure. This task is to determine what that API usage will look like. (Note that this task is not concerned with modifying the OCR Tool's own API.)

Initial ideas and questions:

  • Transkribus API docs: https://readcoop.eu/transkribus/docu/rest-api/
  • Developers can register their own personal accounts on Transkribus and get an API key to use during development. The OCR Tool will have its own production API key (and all users will operate via that).
  • The tool has the following data available:
    • A URL of the image, or the image itself. This is a scaled-down version generally about 1000px across. This may also be a pre-cropped part of a larger page.
    • A language code, which we'll map to an existing Transkribus model (or small number of models?).
  • Should all images be uploaded to the same collection? Will we delete them immediately after text-extraction is finished?
  • How do we handle layout analysis? Can we submit layout regions at the time of submitting the OCR job? Do we need our own way of storing layout data on the Wikimedia side (e.g. similar to the Image-Annotator)?
  • Do we expect the user to interact with the Transkribus UI ever? This seems unlikely as we're not expecting them to have their own accounts.

Event Timeline

Wrt to Layout storage, maybe once we implement T294903 we will be able to directly allow users to send layout data ?

Thanks @KLawal-WMF that's great.

Would you mind copying it to here on Phabricator (as a comment is fine), so it's more searchable and to avoid linkrot in the future? (The Remarkup Reference might be useful.)

Wikisource:Transkribus OCR

The Transkribus OCR tool adds a Page-namespace toolbar button that will derive text from the current page's image, via Transkribus OCR service.

Authentication request flow

The authentication follows the OAuth2/Open ID Connect specification, implementing the grant_type password for the integration, and using processing-api-client as client_id is recommended.

You’ll need the following to get started:

  • A Transkribus account
  • An HTTP library capable of making OAuth 2 requests.

If you need a Transkribus account you can sign up here to get started. The example in this document requires PHP 7.4+ and curl as the HTTP library.
N/B the Metagrapho API access is granted only by request

Authorization Example:

$postData = [
	“username” => “example@wikimedia.org”,
        “password” => “password”,
        “grant_type” => “password”,
        “client_id” => “processing-api-client”,
         “scope” => “offline_access”,
];
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,"https://account.readcoop.eu/auth/realms/readcoop/protocol/openid-connect/token"); 
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postData));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec ($ch); 
curl_close ($ch);


$data = json_decode($response, true);

The response includes the following.

{
   "access_token": "eyJhbGciO…”,
   "expires_in": 299,
   "refresh_expires_in": 0,
   "refresh_token": "eyJhbGc..”,
   "token_type": "bearer",
   "not-before-policy": 0,
   "session_state": "c413c3f8-80b4-4f97-a81b-655c3e4ccec7",
   "scope": "profile offline_access email"
}

​​The refresh token is valid for 30 days when not used. The response should be stored for later use.

Refresh the Access Token:

If it’s been a while since the access token was obtained, all requests to Transkribus API will return an unauthorized error. To correct the error, you need to refresh your access token like so.

$previousTranskribusResponse = $data = json_decode($response, true);
$refreshToken = $previousTranskribusResponse[‘refresh_token’];


$postData = [
	“username” => “example@wikimedia.org”,
        “grant_type” => “refresh_token”,
        “client_id” => “processing-api-client”,
        “refresh_token” => $refreshToken,
];
$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL,"https://account.readcoop.eu/auth/realms/readcoop/protocol/openid-connect/token"); 
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postData));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/x-www-form-urlencoded'));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec ($ch); 
curl_close ($ch);


$data = json_decode($response, true);

The response includes the following.

{
   "access_token": "eyJhbGciO…”,
   "expires_in": 299,
   "refresh_expires_in": 0,
   "refresh_token": "eyJhbGc..”,
   "token_type": "bearer",
   "not-before-policy": 0,
   "session_state": "c413c3f8-80b4-4f97-a81b-655c3e4ccec7",
   "scope": "profile offline_access email"
}

Data processing

POST https://transkribus.eu/processing/v1/processes

A POST request to this endpoint creates a new upload process on the server. It is mandatory to set the field config and the image.
The config field specifies how the image representation shall be processed, there are two types of models that can be set, models for layout and models for text. Example of a config is as below

"config": {
   "textRecognition": {
     "htrId": 38230
   },
   "lineDetection": {
     "modelId": 38230
   }
}

modelId and htrId can be found here
If no config is set the default will be line recognition.

The image field contains the image to process: either the base64-encoded binary data or a reference to a publicly available image (URL). Example of a config is as below

"image": {
     "imageUrl": "https://upload.wikimedia.org/wikipedia/commons/6/68/Fragmented_OCR_segments.png?20111217214954",
     "base64": "/9j/4AAQSkZJRgABAQ…",
}

Submit Data for processing:

Example of how to begin a process is stated below:

$previousTranskribusResponse = $data = json_decode($response, true);
$accessToken = $previousTranskribusResponse[“access_token”];

$postData = [
    “config” => [],
    “image” => [
        "imageUrl" => "https://upload.wikimedia.org/wikipedia/commons/6/68/Fragmented_OCR_segments.png?20111217214954",
    ],
];

$authorization = "Authorization: Bearer ".$accessToken;

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"https://transkribus.eu/processing/v1/processes"); 
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postData));
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json' , $authorization ));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec ($ch); 
curl_close ($ch);

$data = json_decode($response, true);

The response includes the following.

{
 "processId": 47725,
 "status": "CREATED"
}

The response should be stored for later use. The processID is important.

Retrieve processing status and result
Currently webhooks can’t be set to listen for changes in processing status. We’ll have to periodically check for the changes using the processId. Processing time is dependent on the size of the image. This can take 5 to 20+ seconds to complete.

Below is an example of how to retrieve a processing status and result:

$authorization = "Authorization: Bearer ".$accessToken;
$processingID = 47725;
$URL = "https://transkribus.eu/processing/v1/processes/".$processingID;

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$URL); 
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json' , $authorization ));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec ($ch); 
curl_close ($ch);

$data = json_decode($response, true);

The response includes the following.

{
 "processId": 3866314,
 "status": "FINISHED",
 "content": {
   "text": "Eisenschrott",
   "regions": [
     {
       "id": "region_1",
       "coords": {
         "points": "0,0 282,0 280,109 0,108"
       },
       "lines": [
         {
           "id": "line_1",
           "coords": {
             "points": "0,109 55,89 79,91 96,106 131,100 152,109 196,75 243,73 254,82 281,70 281,13 242,11 228,-1 173,15 146,3 98,13 65,-2 1,8"
           },
           "baseline": {
             "points": "6,64 276,65"
           },
           "text": "Eisenschrott",
           "words": [
             {
               "id": "line_1_w1",
               "coords": {
                 "points": "11,24 11,84 278,84 278,24"
               },
               "text": "Eisenschrott"
             }
           ]
         }
       ]
     }
   ]
 }
}

Thanks @KLawal-WMF this is really good.