Page MenuHomePhabricator

[langid] fasttext only processes one line at a time
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
This request gives an error:

curl https://api.wikimedia.org/service/lw/inference/v1/models/langid:predict -X POST -d '{"text": "Some sample text in any language that we want to identify\n\n\n"}'

https://github.com/facebookresearch/fastText/issues/1079

more logs with similar requests available on logstash

What happens?:
gives 500 response.

What should have happened instead?:
We should strip text from any special characters.
Separate lines should be concatenated.

I also recommend truncating the string and keeping only first 50-100 characters which would be sufficient for language identification.

Event Timeline

isarantopoulos triaged this task as Unbreak Now! priority.Oct 22 2024, 2:36 PM
isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.

Change #1082438 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] langid: normalize text input

https://gerrit.wikimedia.org/r/1082438

Text normalization has been added to the langid model-server and it fixed this issue as shown below:

Before Normalization
root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Some sample text in any language that we want to identify\n\n\n"}'
HTTP/1.1 500 Internal Server Error
date: Wed, 23 Oct 2024 05:53:45 GMT
server: uvicorn
content-length: 76
content-type: application/json

{"error":"ValueError : predict processes one line at a time (remove '\\n')"}
After Normalization
root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Some sample text in any language that we want to identify\n\n\n"}'
HTTP/1.1 200 OK
date: Wed, 23 Oct 2024 08:39:00 GMT
server: uvicorn
content-length: 91
content-type: application/json

{"language":"eng_Latn","wikicode":"en","languagename":"English","score":0.4073379337787628}

We noticed that keeping only alphanumeric characters removes spaces and punctuation marks which changes the prediction results as shown below:
1.Langauge prediction changed from English to Ilocano when spaces were removed

root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Some random text in any language"}'
HTTP/1.1 200 OK
date: Wed, 23 Oct 2024 13:26:51 GMT
server: uvicorn
content-length: 90
content-type: application/json

{"language":"eng_Latn","wikicode":"en","languagename":"English","score":0.223587766289711}

root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Somerandomtextinanylanguage"}'
HTTP/1.1 200 OK
date: Wed, 23 Oct 2024 13:27:37 GMT
server: uvicorn
content-length: 91
content-type: application/json

{"language":"ilo_Latn","wikicode":"ilo","languagename":"Ilocano","score":0.357835054397583}

2.Confidence score dropped from 1.0000 to 0.9994 when punctuation marks were removed

root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "¡Hola! ¿Cómo estás?"}'
HTTP/1.1 200 OK
date: Wed, 23 Oct 2024 13:43:09 GMT
server: uvicorn
content-length: 91
content-type: application/json

{"language":"spa_Latn","wikicode":"es","languagename":"Spanish","score":1.0000100135803223}

root@a49ef1a9119b:/home# curl localhost:8080/v1/models/langid:predict -i -X POST -d '{"text": "Hola Cómo estás"}'
HTTP/1.1 200 OK
date: Wed, 23 Oct 2024 13:42:57 GMT
server: uvicorn
content-length: 91
content-type: application/json

{"language":"spa_Latn","wikicode":"es","languagename":"Spanish","score":0.9994144439697266}

To avoid negatively impacting prediction accuracy, we have agreed that the normalization should only replace: newlines, tabs, and multiple consecutive spaces with a single space. This maintains the overall structure of the text while resolving prediction errors in the fasttext model.

Change #1082438 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] langid: normalize text input

https://gerrit.wikimedia.org/r/1082438

Change #1082704 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: normalize text input in langid

https://gerrit.wikimedia.org/r/1082704

Change #1082704 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: normalize text input in langid

https://gerrit.wikimedia.org/r/1082704

The new langid image with a model-server that normalizes text input has been deployed :

# pod running in eqiad
kevinbazira@deploy2002:~$ kube_env llm ml-serve-eqiad
kevinbazira@deploy2002:~$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS   AGE
langid-predictor-default-00010-deployment-6bdfc66fc5-vc2fl   3/3     Running   0          70s

# pod running in codfw
kevinbazira@deploy2002:~$ kube_env llm ml-serve-codfw
kevinbazira@deploy2002:~$ kubectl get pods
NAME                                                         READY   STATUS    RESTARTS   AGE
langid-predictor-default-00008-deployment-865876dd44-bwh7k   3/3     Running   0          30s

# isvc run successfully
kevinbazira@deploy2002:~$ curl "https://inference.svc.codfw.wmnet:30443/v1/models/langid:predict" -X POST -d '{"text": "Some sample text in any language that we want to identify\n\n\n"}' -H  "Host: langid.llm.wikimedia.org"
{"language":"eng_Latn","wikicode":"en","languagename":"English","score":0.4073379337787628}