Provide a service to detect which language the user is writing on
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Pginer-WMF
	May 19 2015, 10:06 AM

Description

Reliably guessing which language the user is writing in can be useful to seamlessly integrate language tools in different contexts such as T295862: View translated messages in Talk pages. In this way the interactions for getting an automatic translation are simplified, not requiring the user to figure out and indicate which is the source language.

An initial iteration providing a basic version of the service is provided in T334465: MinT: Detect language of source content automatically.
Further development, as part of other sub-tasks, will be needed to cover more languages in a reliable way.

Details

	Subject	Repo	Branch	Lines +/-
	inference-services: add langid CI pipelines	integration/config	master	+15 -0
	language detection: Use fasttext	mediawiki/services/machinetranslation	master	+117 -22

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T363306 Wish and focus area pages translation
Open	None	T295862 View translated messages in Talk pages
Declined	None	T98728 View translated messages in Flow
Resolved	santhosh	T99666 Provide a service to detect which language the user is writing on
Resolved	santhosh	T334465 MinT: Detect language of source content automatically
Resolved	isarantopoulos	T340507 Create a language detection service in LiftWing

Event Timeline

Pginer-WMF created this task.May 19 2015, 10:06 AM

Pginer-WMF raised the priority of this task from to Medium.

Pginer-WMF updated the task description. (Show Details)

Pginer-WMF added projects: ContentTranslation, Language-Team.

Pginer-WMF added subscribers: KartikMistry, • Mattflaschen-WMF, • DannyH and 3 others.

• Mattflaschen-WMF added a subscriber: Amire80.May 20 2015, 9:09 PM

I'm not sure what is this about. Is it supposed to be in CX?

Implementation wise it should be separate API module or service and not tied to CX.

Amire80 moved this task from Needs Triage to Bugs on the ContentTranslation board.Jul 3 2015, 1:45 PM

Arrbee removed a project: Language-Team.Jul 22 2015, 10:15 PM

Arrbee set Security to None.

Amire80 added a project: OKR-Work.Oct 15 2015, 7:13 PM

Arrbee moved this task from Bugs to Enhancements on the ContentTranslation board.Jun 22 2018, 1:40 PM

Arrbee moved this task from Bugs to Enhancements on the ContentTranslation board.

Arrbee moved this task from Enhancements to Check & Move on the ContentTranslation board.Jul 22 2019, 1:47 PM

Arrbee edited projects, added I18n; removed ContentTranslation.Jul 22 2019, 2:44 PM

Aklapper added a project: WMF-General-or-Unknown.Oct 24 2020, 1:54 PM

Aklapper removed a subscriber: • Mattflaschen-WMF.

A language detection service can be relevant in the context of the upcoming hosting of machine translation (T331505). Integration with the services would open the possibility to have the source language as an optional parameter. In such case, the source language would be detected automatically using the service proposed in the current ticket.

Pginer-WMF added a parent task: T295862: View translated messages in Talk pages.Mar 31 2023, 9:37 AM

Pginer-WMF mentioned this in T295862: View translated messages in Talk pages.

Pginer-WMF updated the task description. (Show Details)

Pginer-WMF added a subtask: T334465: MinT: Detect language of source content automatically.Apr 12 2023, 6:13 AM

Pginer-WMF updated the task description. (Show Details)Apr 12 2023, 6:16 AM

In T334465: MinT: Detect language of source content automatically we used pycld2 which is a simple old library for language detection since the usecase was a silent attempt to detect languages while users make their choice on top of it. This library can detect only ~83 languages.

For a general purpose language detection service for large set of languages, the current known approach is fasttext language classification model that can detect 176 languages.

Change 908223 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Use fasttext based language detection

https://gerrit.wikimedia.org/r/908223

gerritbot added a project: Patch-For-Review.Apr 12 2023, 11:29 AM

santhosh claimed this task.Apr 17 2023, 4:16 AM

To support more languages, we can follow the documentation at https://fasttext.cc/blog/2017/10/02/blog-post.html
However, this require a large training dataset that has samples for all the languages we want to support. For example, tum is not supported fasttext. So we need a dataset of tum language sentences(about 100K for example).

So this becomes a task to prepare dataset that has large number of sentence samples for every language support. Such a dataset would be useful in many contexts too. It is not difficult to create it - may be a crawler with a sentence splitter is a good starting point.

From my experience fasttext is quite fast and can do this training in sufficiently capable CPUs.

santhosh mentioned this in T334465: MinT: Detect language of source content automatically.May 17 2023, 8:31 AM

Change 932828 had a related patch set uploaded (by Santhosh; author: Santhosh):

[machinelearning/liftwing/inference-services@main] Add language identification service

https://gerrit.wikimedia.org/r/932828

I came across this paper recently https://arxiv.org/pdf/2305.13820.pdf which explains the process of creating larger language identification models. They also provide a model that covers 201 languages

Source: https://github.com/laurieburchell/open-lid-dataset
Model: https://data.statmt.org/lid/lid201-model.bin.gz
Model license: the GNU General Public License v3.0.

As we have LiftWing that helps to easily host these models, I submitted a patch to add the service.

@santhosh Hi! Very happy to see this request :) Do you mind to open a separate task as indicated in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_a_model ?

Our way of doing things is to assign a ML Engineer to every request, so they can discuss the use case with the requester and follow up accordingly (for example, to set up CI, upload the model binary, etc..)

Change 933092 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[integration/config@master] inference-services: add langid CI pipelines

https://gerrit.wikimedia.org/r/933092

isarantopoulos added a project: Machine-Learning-Team.Jun 26 2023, 12:34 PM

isarantopoulos removed a project: Machine-Learning-Team.

Change 933092 merged by jenkins-bot:

[integration/config@master] inference-services: add langid CI pipelines

https://gerrit.wikimedia.org/r/933092

santhosh mentioned this in T340507: Create a language detection service in LiftWing.Jun 27 2023, 4:26 AM

Change 908223 abandoned by Santhosh:

[mediawiki/services/machinetranslation@master] language detection: Use fasttext

Reason:

This will be a new service hosted in lift wing

https://gerrit.wikimedia.org/r/908223

Winston_Sung moved this task from Untriaged to Analysis on the I18n board.Aug 9 2023, 2:53 PM

Pginer-WMF closed subtask T334465: MinT: Detect language of source content automatically as Resolved.Sep 26 2023, 10:56 AM

We have the service in production: https://api.wikimedia.org/wiki/Lift_Wing_API/Reference/Get_language_identification_prediction

Awesome!
Tried the API with the sentence from German Wikipedia nad was correctly identified a German.

For the example I used this request:

curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Die Sonne ist der Stern, der der Erde am nächsten ist und das Zentrum des Sonnensystems bildet."}' -H "Content-type: application/json"

and got the correct answer :

{"language":"deu_Latn","wikicode":"de","languagename":"German","score":1.0000075101852417}

Pginer-WMF updated the task description. (Show Details)Oct 20 2023, 8:51 AM

Pginer-WMF updated the task description. (Show Details)

Pginer-WMF mentioned this in T349618: Automatic language detection misidentifies language in some cases.Oct 24 2023, 1:37 PM

isarantopoulos closed subtask T340507: Create a language detection service in LiftWing as Resolved.Nov 2 2023, 10:53 PM

Provide a service to detect which language the user is writing onClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Provide a service to detect which language the user is writing on
Closed, ResolvedPublic
Actions

Related Objects
Search...