Page MenuHomePhabricator

Provide a service to detect which language the user is writing on
Closed, ResolvedPublic

Description

Reliably guessing which language the user is writing in can be useful to seamlessly integrate language tools in different contexts such as T295862: View translated messages in Talk pages. In this way the interactions for getting an automatic translation are simplified, not requiring the user to figure out and indicate which is the source language.

An initial iteration providing a basic version of the service is provided in T334465: MinT: Detect language of source content automatically.
Further development, as part of other sub-tasks, will be needed to cover more languages in a reliable way.


Related: T340507: Create a language detection service in LiftWing

Event Timeline

Pginer-WMF raised the priority of this task from to Medium.
Pginer-WMF updated the task description. (Show Details)

I'm not sure what is this about. Is it supposed to be in CX?

Implementation wise it should be separate API module or service and not tied to CX.

A language detection service can be relevant in the context of the upcoming hosting of machine translation (T331505). Integration with the services would open the possibility to have the source language as an optional parameter. In such case, the source language would be detected automatically using the service proposed in the current ticket.

In T334465: MinT: Detect language of source content automatically we used pycld2 which is a simple old library for language detection since the usecase was a silent attempt to detect languages while users make their choice on top of it. This library can detect only ~83 languages.

For a general purpose language detection service for large set of languages, the current known approach is fasttext language classification model that can detect 176 languages.

Change 908223 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Use fasttext based language detection

https://gerrit.wikimedia.org/r/908223

To support more languages, we can follow the documentation at https://fasttext.cc/blog/2017/10/02/blog-post.html
However, this require a large training dataset that has samples for all the languages we want to support. For example, tum is not supported fasttext. So we need a dataset of tum language sentences(about 100K for example).

So this becomes a task to prepare dataset that has large number of sentence samples for every language support. Such a dataset would be useful in many contexts too. It is not difficult to create it - may be a crawler with a sentence splitter is a good starting point.

From my experience fasttext is quite fast and can do this training in sufficiently capable CPUs.

Change 932828 had a related patch set uploaded (by Santhosh; author: Santhosh):

[machinelearning/liftwing/inference-services@main] Add language identification service

https://gerrit.wikimedia.org/r/932828

I came across this paper recently https://arxiv.org/pdf/2305.13820.pdf which explains the process of creating larger language identification models. They also provide a model that covers 201 languages

As we have LiftWing that helps to easily host these models, I submitted a patch to add the service.

@santhosh Hi! Very happy to see this request :) Do you mind to open a separate task as indicated in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_a_model ?

Our way of doing things is to assign a ML Engineer to every request, so they can discuss the use case with the requester and follow up accordingly (for example, to set up CI, upload the model binary, etc..)

Change 933092 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[integration/config@master] inference-services: add langid CI pipelines

https://gerrit.wikimedia.org/r/933092

Change 933092 merged by jenkins-bot:

[integration/config@master] inference-services: add langid CI pipelines

https://gerrit.wikimedia.org/r/933092

Change 908223 abandoned by Santhosh:

[mediawiki/services/machinetranslation@master] language detection: Use fasttext

Reason:

This will be a new service hosted in lift wing

https://gerrit.wikimedia.org/r/908223

Awesome!
Tried the API with the sentence from German Wikipedia nad was correctly identified a German.

For the example I used this request:

curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Die Sonne ist der Stern, der der Erde am nächsten ist und das Zentrum des Sonnensystems bildet."}' -H "Content-type: application/json"

and got the correct answer :

{"language":"deu_Latn","wikicode":"de","languagename":"German","score":1.0000075101852417}