Page MenuHomePhabricator

Provide better long-term storage for translation models
Closed, ResolvedPublic8 Estimated Story Points

Description

As part of the exploration for self-hosted translation service (T331505) there was the question about where to store the models. Currently machine learning models for translation are stored at: https://people.wikimedia.org/~santhosh/nllb/ and https://people.wikimedia.org/~santhosh/opusmt

However, since the ML cluster uses Swift, that may be more resilient long term.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+3 -9
operations/deployment-chartsmaster+8 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/machinetranslationmaster+13 -8
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+8 -0
mediawiki/services/machinetranslationmaster+146 -126
operations/puppetproduction+6 -5
labs/privatemaster+1 -1
operations/puppetproduction+6 -5
labs/privatemaster+1 -1
operations/puppetproduction+9 -2
operations/deployment-chartsmaster+8 -4
operations/deployment-chartsmaster+16 -1
mediawiki/services/machinetranslationmaster+128 -115
operations/deployment-chartsmaster+2 -1
operations/puppetproduction+5 -0
labs/privatemaster+1 -0
operations/deployment-chartsmaster+4 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Assigning this to myself, I'll need help here @elukey :)

Assigning this to myself, I'll need help here @elukey :)

@KartikMistry definitely yes, I added most of my thoughts earlier on but feel free to ping me anytime if you need help!

KartikMistry changed the task status from Open to In Progress.Jan 22 2025, 11:22 AM
Nikerabbit set the point value for this task to 8.Jan 22 2025, 2:52 PM
Nikerabbit raised the priority of this task from Medium to High.Feb 17 2025, 8:55 AM
Nikerabbit changed the task status from In Progress to Stalled.Mar 3 2025, 9:17 AM
Nikerabbit changed the task status from Stalled to In Progress.May 5 2025, 8:35 AM
Nikerabbit changed the status of subtask T386889: MinT: Deployment timeouts for eqiad from In Progress to Stalled.

With T391958, we've migrated from people.wikimedia.org to https://analytics.wikimedia.org/published/wmf-ml-models/mint/ for external storage. Internally, models will use s3 storage.

Change #1147812 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] entrypoint: Update models download URL

https://gerrit.wikimedia.org/r/1147812

Change #1147812 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] entrypoint: Update model storage download

https://gerrit.wikimedia.org/r/1147812

Change #1149391 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] charts: Add secrets template for Machine Translation

https://gerrit.wikimedia.org/r/1149391

Change #1149391 merged by jenkins-bot:

[operations/deployment-charts@master] charts: Add secrets template for Machine Translation

https://gerrit.wikimedia.org/r/1149391

With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):

# cd /srv/deployment-charts/helmfile.d/services/machinetranslation/
# helmfile -e eqiad  -i diff --context=3                  
skipping missing values file matching "values-eqiad.yaml"                                                                                                                                                          
skipping missing values file matching "values-production.yaml"                                                                                                                                                     
Comparing release=production, chart=wmf-stable/machinetranslation, namespace=machinetranslation
machinetranslation, machinetranslation-production-secret-config, Secret (v1) has been added:             
+ # Source: machinetranslation/templates/secret.yaml                                                     
+ apiVersion: v1                       
+ kind: Secret                      
+ metadata:                           
+   labels:                                       
+     app: machinetranslation              
+     chart: machinetranslation-0.0.23         
+     heritage: Helm                                                                                                                                                                                               
+     release: production                                                                                                                                                                                          
+   name: machinetranslation-production-secret-config                                                    
+ data:                                                                                                  
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'     
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'                                                       
+ type: Opaque

With the above patch (and the private repo stuff) merged, we can diff on the deployment server (I elided some unrelated changes):

Thanks!

Change #1159696 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] WIP: machinetranslation: Use s3 storage for production

https://gerrit.wikimedia.org/r/1159696

@klausman is there any reason why we can't see following in the diff in staging?

+ data:                                                                                                  
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'     
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'

@klausman is there any reason why we can't see following in the diff in staging?

+ data:                                                                                                  
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'     
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'

I am not sure. As far as I can see, the secrets are wired up independently of clusters (e.g. both eqiad and codfw show a correct diff, and the diff for staging has everything except the actual secrets), so I suspect it's something in how the chart works. I will investigate.

Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:

machinetranslation, machinetranslation-staging-secret-config, Secret (v1) has been added:
+ # Source: machinetranslation/templates/secret.yaml
+ apiVersion: v1
+ kind: Secret
+ metadata:
+   labels:
+     app: machinetranslation
+     chart: machinetranslation-0.0.23
+     heritage: Helm
+     release: staging
+   name: machinetranslation-staging-secret-config
+ data:
+   AWS_ACCESS_KEY_ID: '++++++++ # (18 bytes)'
+   AWS_SECRET_ACCESS_KEY: '++++++++ # (16 bytes)'
+ type: Opaque

Found it: the secrets were not wired up for staging because I had a brain fart when setting that up. It's been fixed in the private repo with commit 7bc13c5d2 (https://gerrit.wikimedia.org/r/c/labs/private/+/1160032 on the pseudo-private one), and staging now shows the correct diff:

Works fine. Thanks!

Change #1159696 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Use S3 storage for production

https://gerrit.wikimedia.org/r/1159696

Change #1162952 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] peopleweb: add KUBEPOD ranges to firewall

https://gerrit.wikimedia.org/r/1162952

Change #1162952 merged by Jelto:

[operations/puppet@production] peopleweb: add KUBEPOD ranges to firewall

https://gerrit.wikimedia.org/r/1162952

Change #1163291 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] entrypoint: Update model storage download

https://gerrit.wikimedia.org/r/1163291

Change #1163739 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials

https://gerrit.wikimedia.org/r/1163739

Change #1163739 merged by Klausman:

[labs/private@master] hiera/k8s: Add missing :prod suffix to machinetranslation S3 credentials

https://gerrit.wikimedia.org/r/1163739

Change #1164235 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera/thanos-swift: Fix MinT user

https://gerrit.wikimedia.org/r/1164235

Change #1165901 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::thanos::swift: rework machinetranslation account

https://gerrit.wikimedia.org/r/1165901

Change #1165901 abandoned by Elukey:

[operations/puppet@production] profile::thanos::swift: rework machinetranslation account

https://gerrit.wikimedia.org/r/1165901

Change #1166543 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] WIP: machinetranslation: Use s3 for model download in staging

https://gerrit.wikimedia.org/r/1166543

Change #1166754 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera/deployment-server: change name of MT AWS user

https://gerrit.wikimedia.org/r/1166754

Change #1166754 merged by Klausman:

[labs/private@master] hiera/deployment-server: change name of MT AWS user

https://gerrit.wikimedia.org/r/1166754

Change #1164235 merged by Klausman:

[operations/puppet@production] hiera/thanos-swift: Fix MinT user

https://gerrit.wikimedia.org/r/1164235

Mentioned in SAL (#wikimedia-operations) [2025-07-07T10:24:08Z] <Emperor> remove swift-account-stats_machinetranslation:prod time & service from thanos-fe1004 T335491

Change #1163291 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] entrypoint: Update model storage download

https://gerrit.wikimedia.org/r/1163291

Change #1166543 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Use s3 for model download in staging

https://gerrit.wikimedia.org/r/1166543

Change #1167328 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] machinetranslation: Remove extra / from s3 URL

https://gerrit.wikimedia.org/r/1167328

Change #1167328 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: Remove extra / from s3 URL

https://gerrit.wikimedia.org/r/1167328

Change #1167448 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] entrypoint.sh: Configure config path properly for s3cmd

https://gerrit.wikimedia.org/r/1167448

Change #1167448 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] entrypoint.sh: Configure config path properly for s3cmd

https://gerrit.wikimedia.org/r/1167448

Change #1167608 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] machinetranslation: staging: Update MinT to 2025-07-09-124154-production

https://gerrit.wikimedia.org/r/1167608

Change #1167608 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslation: staging: Update MinT to 2025-07-09-124154-production

https://gerrit.wikimedia.org/r/1167608

Status update:

We're testing the entrypoint.sh in the staging (using values-staging.yaml). Currently, s3cmd is still failing to find or parse the config file.

[2025-07-10 05:39:27] s3cfg written to models/.s3cfg
ERROR: Configuration file not available.
ERROR: models/.s3cfg: None

Using --debug flag locally shows that the file is parsed, but the error is still shown.

$ s3cmd --debug --config .s3cfg ls
DEBUG: s3cmd version 2.4.0
DEBUG: ConfigParser: Reading file '.s3cfg'
DEBUG: ConfigParser: access_key->${...17_chars...}
DEBUG: ConfigParser: secret_key->${...21_chars...}
DEBUG: ConfigParser: host_base->https://thanos-swift.discovery.wmnet
DEBUG: ConfigParser: host_bucket->https://thanos-swift.discovery.wmnet
DEBUG: ConfigParser: use_https->True
DEBUG: ConfigParser: signature_v2->False
ERROR: .s3cfg: None
ERROR: Configuration file not available.
ERROR: Consider using --configure parameter to create one.

Change #1167742 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] machinetranslation: add snippet to fetch private env variables

https://gerrit.wikimedia.org/r/1167742

Change #1167742 merged by Elukey:

[operations/deployment-charts@master] machinetranslation: add snippet to fetch private env variables

https://gerrit.wikimedia.org/r/1167742

Change #1167854 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] machinetranslationt: Use s3 model storage for production

https://gerrit.wikimedia.org/r/1167854

Update: We've now staging server running using S3 model storage and observing logs, startup time. The patch for production is ready for deploy.

Change #1167854 merged by jenkins-bot:

[operations/deployment-charts@master] machinetranslationt: Use s3 model storage for production

https://gerrit.wikimedia.org/r/1167854

Mentioned in SAL (#wikimedia-operations) [2025-07-14T12:34:34Z] <kart_> machinetranslationt: Use s3 model storage for production (T335491)

Since logs are fine, we don't have anything specific to QA for this task. Thanks to all for the help!

Hello,

I see this ticket is resolved now. I have been watching it to see if you don't need to use people hosts anymore to store large files once this is used instead.

Is that the case now and would it be ok to delete those large files from the home dirs (of Kartik and Santhosh)?

Thanks,

Daniel

@Dzahn Yes. We can remove MinT models from our home directories at the people.wikimedia.org.

Yay, thank you @KartikMistry ! :) (this means I should not get warning emails anymore and we can move forward with a low prio but stalled task, appreciate it).