Project upstream: https://github.com/kscanne/caighdean
- API is reviewed
- Infrastructure needs to activate this service are determined
Project upstream: https://github.com/kscanne/caighdean
I can get it to translate via the linked service:
The total payload limit is 16 KiB, but I cannot get to produce error 413 as documented:
Spaces get lost due to the way the results are formatted:
The service can translate HTML formatted input, which is good.
During testing, I wrote a simple client for the Translate extension (not cxserver).
The point about spaces turns out to be more problematic, as it is not easy for them to include the spaces in the output because of multi-word expressions. For the same reason the specified workaround is not a good idea. The options seem to be to write an algorithm for adding spaces back, and/or mark the places where there is no space expected.
Can you give a sample input and output here from the API? (preferably HTML input)
The API documentation has this example
Input:
Agus thubhairt e, "Iongantach!" an dèidh sin.
Output:
[["Agus","Agus"],["thubhairt","dúirt"],["e","sé"],[",",","],["\\n","\\n"],["\"","\""],["Iongantach","Iontach"],["!","!"],["\"","\""],["an dèidh sin","ina dhiaidh sin"],[".","."]]
I wonder how do you interpret the word order of translation(output) since the items are arranged in the order of source sentence token order.
During testing, I wrote a simple client for the Translate extension (not cxserver).
https://gerrit.wikimedia.org/r/#/c/338983/ is the patch with the client for Translate extension.
I don't know enough about the languages to test a case where word order differs. I can ask about that.
On further testing I found out that the service has some problems and does not return anything for input contains strings such as bhrìgh or 日本語.
Input:
<p>Tha sgoilearan eile a' sìreadh tùs a' chànain ann an <a href="/w/index.php?title=S%C3%B2isealtasan&action=edit&redlink=1" class="new" title="Sòisealtasan (chan eil duilleag ann fhathast)">sòisealtasan</a> coltach ri Seapan a thaobh <a href="/w/index.php?title=Cultar&action=edit&redlink=1" class="new" title="Cultar (chan eil duilleag ann fhathast)">cultar</a> agus <a href="/wiki/Eige%C3%B2las" class="mw-redirect" title="Eigeòlas">eigeòlas</a> eadar ceann a Deas na <a href="/wiki/An_t-S%C3%ACn" class="mw-redirect mw-disambig" title="An t-Sìn">Sìne</a> agus <a href="/w/index.php?title=An_Himalaya&action=edit&redlink=1" class="new" title="An Himalaya (chan eil duilleag ann fhathast)">An Himalaya</a>. An dèidh sin, tha feadhainn eile a' cumail a-mach gur ionnan tùs a' chànain agus na <a href="/w/index.php?title=C%C3%A0nanan_Draibhideach&action=edit&redlink=1" class="new" title="Cànanan Draibhideach (chan eil duilleag ann fhathast)">cànanan Draibhideach</a> agus <a href="/w/index.php?title=C%C3%A0nanan_nan_ioma-eileanach&action=edit&redlink=1" class="new" title="Cànanan nan ioma-eileanach (chan eil duilleag ann fhathast)">cànanan nan ioma-eileanach</a>.(Polynesia) 'S ann tric a bhithear a' moladh gur e <a href="/w/index.php?title=Cr%C3%ACtheol&action=edit&redlink=1" class="new" title="Crìtheol (chan eil duilleag ann fhathast)">crìtheol</a> a th' anns a' chànan seo, a' gabhail a-steach tuilleadh air aon de na teangannan sin.</p>
Output (with spaces added by oracle algorithm):
<p>Tá daltaí eile ag lorg foinse na teanga i <a href="/w/index.php?title=S%C3%B2isealtasan&action=edit&redlink=1" class="new" title="Sòisealtasan (chan eil duilleag ann fhathast)">sochaithe</a> cosúil leis an tSeapáin i dtaobh <a href="/w/index.php?title=Cultar&action=edit&redlink=1" class="new" title="Cultar (chan eil duilleag ann fhathast)">cultúr</a> agus <a href="/wiki/Eige%C3%B2las" class="mw-redirect" title="Eigeòlas">éiceolaíocht</a> idir ceann Theas na <a href="/wiki/An_t-S%C3%ACn" class="mw-redirect mw-disambig" title="An t-Sìn">Síne</a> agus <a href="/w/index.php?title=An_Himalaya&action=edit&redlink=1" class="new" title="An Himalaya (chan eil duilleag ann fhathast)">An Himalaya</a>. Ina dhiaidh sin, tá dream eile ag coinneáil amach gur ionann tús an teanga agus na <a href="/w/index.php?title=C%C3%A0nanan_Draibhideach&action=edit&redlink=1" class="new" title="Cànanan Draibhideach (chan eil duilleag ann fhathast)">teangacha Dráivideacha</a> agus <a href="/w/index.php?title=C%C3%A0nanan_nan_ioma-eileanach&action=edit&redlink=1" class="new" title="Cànanan nan ioma-eileanach (chan eil duilleag ann fhathast)">teangacha na Polainéise</a>.(Polynesia) Is minic a bhítear ag moladh gurb é <a href="/w/index.php?title=Cr%C3%ACtheol&action=edit&redlink=1" class="new" title="Crìtheol (chan eil duilleag ann fhathast)">gcriól</a> atá sa theanga seo, ag gabháil isteach tuilleadh ar cheann de na teangacha sin.</p>
The variables should be posted using application/x-www-form-urlencoded in the body.
Word ordering is not a problem, since word ordering in the engine only happens withing multi-word expressions (aka MWE), and those are transparent when processing the tokens in order. I was given an example:
Input: Ghow eh toshiaght daa laa er dy henney
Output:
Ghow eh toshiaght => Thosaigh sé daa => dhá laa => lá er dy henney => ó shin \n => \n