Page MenuHomePhabricator

Review Caighdean MT service
Closed, ResolvedPublic

Description

Project upstream: https://github.com/kscanne/caighdean

  • API is reviewed
  • Infrastructure needs to activate this service are determined

Event Timeline

Arrbee triaged this task as Medium priority.Feb 8 2017, 6:47 AM

I can get it to translate via the linked service:

  • I did not evaluate possibility of setting up the software ourselves, as using an external service seems preferred in my opinion.
  • I did not evaluate whether the service can sustain a load, as doing that without asking permission would be unkind. I think they should comment on this from their end. In any case, given the language pairs they support, I am not expecting high load.

The total payload limit is 16 KiB, but I cannot get to produce error 413 as documented:

  • This does not matter much, as we should check the limit on our side before sending.

Spaces get lost due to the way the results are formatted:

  • Can be worked around by replacing spaces with <space> or such, but this is not ideal.
  • I think it would be easy for them to include spaces as tokens in the result.

The service can translate HTML formatted input, which is good.

  • We could use the alignment information given by their engine, but it seems easier to just use our standard HTML based alignment.

During testing, I wrote a simple client for the Translate extension (not cxserver).

The point about spaces turns out to be more problematic, as it is not easy for them to include the spaces in the output because of multi-word expressions. For the same reason the specified workaround is not a good idea. The options seem to be to write an algorithm for adding spaces back, and/or mark the places where there is no space expected.

Can you give a sample input and output here from the API? (preferably HTML input)

The API documentation has this example
Input:

Agus thubhairt e,
"Iongantach!" an dèidh sin.

Output:

[["Agus","Agus"],["thubhairt","dúirt"],["e","sé"],[",",","],["\\n","\\n"],["\"","\""],["Iongantach","Iontach"],["!","!"],["\"","\""],["an dèidh sin","ina dhiaidh sin"],[".","."]]

I wonder how do you interpret the word order of translation(output) since the items are arranged in the order of source sentence token order.

During testing, I wrote a simple client for the Translate extension (not cxserver).

https://gerrit.wikimedia.org/r/#/c/338983/ is the patch with the client for Translate extension.

I don't know enough about the languages to test a case where word order differs. I can ask about that.

On further testing I found out that the service has some problems and does not return anything for input contains strings such as bhrìgh or 日本語.

Input:

<p>Tha sgoilearan eile a' sìreadh tùs a' chànain ann an <a href="/w/index.php?title=S%C3%B2isealtasan&amp;action=edit&amp;redlink=1" class="new" title="Sòisealtasan (chan eil duilleag ann fhathast)">sòisealtasan</a> coltach ri Seapan a thaobh <a href="/w/index.php?title=Cultar&amp;action=edit&amp;redlink=1" class="new" title="Cultar (chan eil duilleag ann fhathast)">cultar</a> agus <a href="/wiki/Eige%C3%B2las" class="mw-redirect" title="Eigeòlas">eigeòlas</a> eadar ceann a Deas na <a href="/wiki/An_t-S%C3%ACn" class="mw-redirect mw-disambig" title="An t-Sìn">Sìne</a> agus <a href="/w/index.php?title=An_Himalaya&amp;action=edit&amp;redlink=1" class="new" title="An Himalaya (chan eil duilleag ann fhathast)">An Himalaya</a>. An dèidh sin, tha feadhainn eile a' cumail a-mach gur ionnan tùs a' chànain agus na <a href="/w/index.php?title=C%C3%A0nanan_Draibhideach&amp;action=edit&amp;redlink=1" class="new" title="Cànanan Draibhideach (chan eil duilleag ann fhathast)">cànanan Draibhideach</a> agus <a href="/w/index.php?title=C%C3%A0nanan_nan_ioma-eileanach&amp;action=edit&amp;redlink=1" class="new" title="Cànanan nan ioma-eileanach (chan eil duilleag ann fhathast)">cànanan nan ioma-eileanach</a>.(Polynesia) 'S ann tric a bhithear a' moladh gur e <a href="/w/index.php?title=Cr%C3%ACtheol&amp;action=edit&amp;redlink=1" class="new" title="Crìtheol (chan eil duilleag ann fhathast)">crìtheol</a> a th' anns a' chànan seo, a' gabhail a-steach tuilleadh air aon de na teangannan sin.</p>

Output (with spaces added by oracle algorithm):

<p>Tá daltaí eile ag lorg foinse na teanga i <a href="/w/index.php?title=S%C3%B2isealtasan&amp;action=edit&amp;redlink=1" class="new" title="Sòisealtasan (chan eil duilleag ann fhathast)">sochaithe</a> cosúil leis an tSeapáin i dtaobh <a href="/w/index.php?title=Cultar&amp;action=edit&amp;redlink=1" class="new" title="Cultar (chan eil duilleag ann fhathast)">cultúr</a> agus <a href="/wiki/Eige%C3%B2las" class="mw-redirect" title="Eigeòlas">éiceolaíocht</a> idir ceann Theas na <a href="/wiki/An_t-S%C3%ACn" class="mw-redirect mw-disambig" title="An t-Sìn">Síne</a> agus <a href="/w/index.php?title=An_Himalaya&amp;action=edit&amp;redlink=1" class="new" title="An Himalaya (chan eil duilleag ann fhathast)">An Himalaya</a>. Ina dhiaidh sin, tá dream eile ag coinneáil amach gur ionann tús an teanga agus na <a href="/w/index.php?title=C%C3%A0nanan_Draibhideach&amp;action=edit&amp;redlink=1" class="new" title="Cànanan Draibhideach (chan eil duilleag ann fhathast)">teangacha Dráivideacha</a> agus <a href="/w/index.php?title=C%C3%A0nanan_nan_ioma-eileanach&amp;action=edit&amp;redlink=1" class="new" title="Cànanan nan ioma-eileanach (chan eil duilleag ann fhathast)">teangacha na Polainéise</a>.(Polynesia) Is minic a bhítear ag moladh gurb é <a href="/w/index.php?title=Cr%C3%ACtheol&amp;action=edit&amp;redlink=1" class="new" title="Crìtheol (chan eil duilleag ann fhathast)">gcriól</a> atá sa theanga seo, ag gabháil isteach tuilleadh ar cheann de na teangacha sin.</p>

The variables should be posted using application/x-www-form-urlencoded in the body.

Word ordering is not a problem, since word ordering in the engine only happens withing multi-word expressions (aka MWE), and those are transparent when processing the tokens in order. I was given an example:

Input: Ghow eh toshiaght daa laa er dy henney
Output:

Ghow eh toshiaght => Thosaigh sé
daa => dhá
laa => lá
er dy henney => ó shin
\n => \n
Arrbee moved this task from In Review to Done on the Language-2017 Sprint 3 board.