Page MenuHomePhabricator

Create a new S3 bucket for MinT
Closed, ResolvedPublic3 Estimated Story Points

Description

Hi folks!

@KartikMistry reached out to me for T335491, I suggested the following, lemme know your thoughts:

  1. Rather than using Thanos Swift, we should ask to Data Persistence if they can create a new S3 bucket on APUS (the same data storage that Lift Wing will have to migrate to, sooner or later).
  2. Once the bucket is created, they will need credentials deployed on statXXXX to be able to push their model when a new one is available (without SRE intervention).
  3. They also need to be able to publish it somewhere for community consumption, so something like https://analytics.wikimedia.org/published/wmf-ml-models/is probably the best compromise.

Ideally we could re-use the automation created for the ML use case, but they may not need everything so I'll let you decide :)

Event Timeline

Alternative discussed on IRC:

<kart_> isaranto: anything for https://phabricator.wikimedia.org/T391958 ? :)
<isaranto> o/ kart_ sorry for never responding 
<isaranto> kart_: do you have a specific timeline? or are you looking for a more long-term solution as the connected task mentions? The reason why I'm asking is that we haven't migrated to APUS yet so it would be a first for us
<kart_> The main task is: https://phabricator.wikimedia.org/T335491
<kart_> But, feel free to update T391958 as well, as it is specific.
<elukey> before the move to apus we could try to figure out if the mint model would go alongside the "ml-team" ones or not
<elukey> for example, do we need a different bucket with different credentials etc..?
<elukey> or do we use the same?
<elukey> and the follow up question is about how to publish it externally - we already have a solution, etc..
<kart_> +1
<elukey> kart_: atm the workflow is the following, for any model - 1) the model "owner" shares the binary with a member of the ml-team via a "trusted" source (drive, stat boxes, etc..) sharing also the SHA512 of the binary separately 2) the ml-team retrieves the binary, 
         checks the SHA and uploads it to S3 and to the DE UI for external consumption
<elukey> the idea is to control as much as possible what gets published, especially because we share it to the community
<elukey> and also to avoid messing up with prod :D
<elukey> if the workflow works with you, it should be easy to integrate
<elukey> and it could work right now without apus
<elukey> but of course you'd need the above steps when releasing a new version
<elukey> I'd argue that a single process is beneficial for everybody, and more checks are good :)
<elukey> so we don't start to have multiple ways to upload binaries etc..
<elukey> isaranto: --^
<kart_> Most of our models won't change frequently or at all.
<kart_> So, it is probably one time setup as of now.
<elukey> then I think it should work
<kart_> Yes
<elukey> let's see what the ML team thinks about!
<kart_> Sure!
<isaranto> so MinT service would require read only access to that bucket (not sure how it would happen). I agree then that we can just use swift as is then 
<isaranto> Since MinT is going to have read only access  and they don't change often I'd suggest that we put them under wmf-ml-models for now.
<kart_> wmf-ml-models should be fine.
<isaranto> klausman: does the above sound ok? What would MinT need to be able to fetch the models?
<klausman> Yeah, I think read-only access is probably relatively easy to set up, though I am not super familiar with how that is set up on the Swift server side
<elukey> klausman: o/ we have the mlserve users in hieradata/common/profile/thanos/swift.yaml
<elukey> and then the password is wired to the kserve pods via puppet private
<elukey> I have realized though that we may not be using the read only credentials for kserve though
<elukey> I am checking hieradata/role/common/deployment_server/kubernetes.yaml
<elukey> no sorry we are
<elukey> AWS_ACCESS_KEY_ID: mlserve:ro
<klausman> ah, that _ -> : conversion tripped me up
<elukey> so it should be easy, data persistence manages the thanos swift cluster, I think there is a procedure to create the new account if needed
<elukey> this is the prev task https://phabricator.wikimedia.org/T311628
<elukey> I don't recall though if there is a specific procedure to generate the password, or if it is in puppet private
<elukey> yes ok it is in hieradata/common/profile/thanos/swift.yaml (private)
<elukey> so, to recap, IIUC/IIRC: 1) contact data persistence to get the green light 2) update puppet private 3) update public puppet 4) update the rest in k8s
<kart_> elukey: possible to log these in the task as well? I keep forgetting IRC ;)

Change #1140118 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] thanos/swift: add user for Mint, with r/o access

https://gerrit.wikimedia.org/r/1140118

All MinT models are available at stat1008:/home/kartik/models. Note that we generally preserve the models subdirectory to separate them.

df07c9f2322105f797bd9f6e64e29c7bb2df9ffdc45e681ddf61c5a710cc3fcf34ec5a51c98492f2b4487bf20b0afdcdefe3ad01c0ed24e78a6e876110f38c40  ./nllb/nllb200-600M.tgz
37be5a8ff880d370d140216dfa58a369b669f440d049698e829cbee3d494c9ec42c5f77d1fccf15e5cd57f5195e3e2ad8a7b92602d2e0d1baecab4a50e82beab  ./nllb/nllb-wikipedia.tgz
07f58f9d22aa8328439188ca8bf224bcb449c3ee58b453c1a25694f6f5b20313213e823c052c49bd5ce79f209f4aa976d4a727f6530aaa4d1861972a434f0296  ./madlad400/madlad400-3b-ct2.tgz
5c1eece1e03452ef3aac637804d01f7342d256f1bdb4680ae8f53c79d25435b379aedf08664f2a4fb8494544288ae0ee1da0557724b6668844430ed261798283  ./softcatala/softcatala-de-ca.zip
7e28f85b1718b5bc400161df72ca8e2234714a7d2b0eb1046113ab5efa15a39d7b45bb7299fc6f7ac054dc52a9f880f840687e9432edad2791b955e6682f5bf2  ./softcatala/softcatala-gl-ca.zip
e011fd1c923523394dca6f8818dcd3828ed0f32b6b257eca417c2b590982fe6d9034743a97c6486e7bc3c8c8e6c60d49f021f7415a37cdf35651d0caf8c317c0  ./softcatala/softcatala-ca-es.zip
670d304b8ff1834f66e497fd18e69795769f96d59edd554fab57a79db6182b2ed1f7d2f1c8ae57c6af00c7e8634c9ccd073c53eaa4cfee5d0ce922c0ad2bcbd5  ./softcatala/softcatala-ca-it.zip
4a9ef8d8b6d89b509c343553affc7fea6fe9912ec5372c5e225085417a459274ccaf5ea3af95d962c8cd6eb69bb2cd3920ffc1b2d0b92fa6b30518252ec9ed9c  ./softcatala/softcatala-pt-ca.zip
4d14fec95fdabbe65ba7e4bda49fe62b7040398d54db3c753d531bf83a6dcfc889ec9c9b57d402ed79e6310736f94fbfa209d17d00875fb2dd6f729ad717634e  ./softcatala/softcatala-ca-pt.zip
0275c360c5b181c4b64e190c282114794e0580c05196bfb9f0127be5fb706192551f44d703f1546ea00dfdd1192283e55449a435098d784622748a5eccef7101  ./softcatala/softcatala-nl-ca.zip
b9b693961619ddf7b2398242ed16a6fd1d1141d8f43dcaef22fb1df1cdc9d0cadcbd7d672ef62c92282410729cd49b8d25c1915c6649a62b3bd3a502e6211a67  ./softcatala/softcatala-it-ca.zip
26820189b28c06abb1b00f982d923bea8690fa7ec7196648d095a01afadc5539680e983bc48498c9c3ea453312b1cc7c279ec9caae6d49d817774d4bed1fb321  ./softcatala/softcatala-ca-de.zip
c23edbf3dd9f44bbe21d835aa262831128603ddceeed1b915ef1f2a6ee5a8e1d9d4dc06416fe7b764f17d5ad2b5ad5c8bad13047f952432c00b4a44781b13f23  ./softcatala/softcatala-ca-gl.zip
92378ffb4f2ee89f6fe6536e7df990b039dc417c8e50235cbb6ec57517bb0bb53969095ef2d7eb088b366fbd78ebef270b0b0d619d56df6832e2a373dc4de74b  ./softcatala/softcatala-ca-fr.zip
aa294ab71174fb351e6ae561c29e9757a8d37c19bce1e9dcb8fa070c6f2bc2a7aafd761e1734fab00bf82b927e61d9979d271d4f8319b34b20dce58055a0da5f  ./softcatala/softcatala-es-ca.zip
0b0466a60d217685e22360510ee37e5f9a6ff0a302b61f5668ea87f1ac6ea973835d0b49732d9a8580cd5e2cbed466e0edf92a0d12208a78366c24bda46c05e7  ./softcatala/softcatala-en-ca.zip
4c9652dfc48a221ba79092d453617e919c6affe8a6aa37a770f64b1f49a7e2b0a0fcc6f6c7ca2a7257a257dc5b1f3ed1874ce63c2653329bdbdb1c5c180dad1e  ./softcatala/softcatala-ca-ja.zip
40c833400869fdca7d44d60aae3428af6e839ab74c01569ef4b6b213700f387fec8d2e7a7c42bfe052ca31862ee82a475afb881025f8bda83f999ada1e8fa9fc  ./softcatala/softcatala-fr-ca.zip
06502ea9e261ca2c5d939580b857f38da60729e7411e4f84a2e8eb48c7345c9c466699861dfc335bbe4a2923672ab731061159e922b7c4d33fb36218cee8c065  ./softcatala/softcatala-ca-oc.zip
cd0448da5e718d29db3479f5e9529a5b7558254a5b4f4d483af3c0b62a7255f3b137593bc0e5a49f1ff91cb45a2859fbfe56617b4b00c76b7e217b741392099d  ./softcatala/softcatala-ja-ca.zip
e99c093a25276706e2cce6f46cf15ae2abc64350031cfbd521576787bdc8a66692ef079c79611f426dd4f8088b8feb543354e474158a74a4ce3529f820077ed2  ./softcatala/softcatala-ca-nl.zip
2b1f549e4167b88e0aca63790427d1c119fb75abb6cc5ff728135a16c61bf00ec49ee19b0644b3bb4326518c97db55351888028937029b5fb2e408a0a64d22fe  ./softcatala/softcatala-oc-ca.zip
9d7350975b027397caac0285c46cd0dd9a99106e2ec1cea6e1bd16e593d51a1e4f5763310cb1f142ea6fd21cc2d049239cea8e6ece33f611933e3b647e47c47a  ./softcatala/softcatala-ca-en.zip
c3b812ba9267e3f849fa1a417b0ef14380511c307854abf5164b7b1641f96358a139abc5848a5f30127e85de7126685f2898f344db657ea73c04a1d0811caa9e  ./indictrans2/indictrans2-indic-indic.tgz
19f0450a00f10900f463a85b543a687368e36c31ceb512a4efae5509b92779b9fa0e9411cfc335fa93e5ed281ae509cb3d441209ceaf1146ab47e07b9ff6bf88  ./indictrans2/indictrans-en-indic.tgz
880eea781200a194a5cc33c7ab3f2d4b6666c481d33f0b7e4768fb05443c2fe6de76bdb916b7295249f0b5c381f9722c3c7f44754f82cc3525b9cc5eaea772e9  ./indictrans2/indictrans-indic-en.tgz
cd1e697f1b390624238eae60841d2979f7cec518666075630f9c5e040e86f95cd24fdcbef3d94af89fbd9eb9dcffd7566070a85df78597b0a6a5fba7f9a93d7a  ./opusmt/opusmt-en-to.zip
431ee96e9af09eca451e063474319aa7bd11002a8dc5d71e6f53e3907d46cc65c19dbdffbfdae8582efadcaf53b22b5efefc07e15fd7fcffb09c401f73ba91b1  ./opusmt/opusmt-en-gv.zip
ae0efc47fcab8af79f3127600af0a169161f82547f54ef35b6eb8e91d0925c65a012188afe8af064b1d4a1ed367e49cdb7f26abd57e9113ff1be5b12ae96fce3  ./opusmt/opusmt-fr-ty.zip
94c4279cbf701d975f4a56a300b0d14d7a59eac73088b5a0fb9e3b3828b25224c779ea9020b502bf98303442727716741d7bd2931324f6d41fb6797fc05c1109  ./opusmt/opusmt-en-chr.zip
69607b1ac9ce1f9aa91a57ac77b91a8d3acbe89eeb9aa80964c4fa74dc3454f255bc2eeaa770bae00812607fe5d1e0f4952280060f437e4c884ae2be205d34c2  ./opusmt/opusmt-en-srn.zip
9b35d447565ae587a0e355acff8a7f198e242069a585053a0fbf43aedf8b5f4394fe1c67d5c34df27a34dfbe0cf91ab60165d2825600ef4cb2ac17702aeca1d2  ./opusmt/opusmt-en-bcl.zip
c60d6a4034e3c87fabfc78f35e0e1cdfef7ee61357f7459030333f88befbb52164408e34668eceb7b2a09c023359f9b9bf8a44a81f2d0b0ac282cf87792c14a7  ./opusmt/opusmt-en-bi.zip
ca3691aed908a37b9879eeeffd8fa298ae2b6aacd4ed21685ded1bef5a499b0e1026a9b345896e059581a17ec72668967648baedaf7ac1425f5be0e9df3a6c1b  ./opusmt/opusmt-en-ve.zip
a3c0a4e110c584468f8286b0c6a2933aaf3207cb3c72375da87c00b9af3c30859c8129f515bd68462fc158a3c5f978105a18d3e5a591bbb987677438562b39d4  ./opusmt/opusmt-en-ty.zip
5352d47bd6b29d8ca0ac817d6578aa8521cb30fe36a643cb887e925ea6fade93b4994a9f8affd95cda81ff5591c81923a9edd29de81f57944d4df314ef8b1b39  ./opusmt/opusmt-en-fr-br.zip
4ddc8f37389b4c749afcd0eeb15caff162ebc1be4c4392a1fd373e6605d17469769ffbc430cd76c10be074c183432ec4f6431a79d284aecbd41f1eadf222c396  ./opusmt/opusmt-sv-fi.zip
6e8dc6648eef45d615047f6db05e770204eaf7556d38cada3b0f94ee6e77033b094bc10c8a5e428c8784f198cc988a85df9247f474c81282ee87899f233e620e  ./opusmt/opusmt-en-guw.zip

Change #1140118 abandoned by Klausman:

[operations/puppet@production] thanos/swift: add user for Mint, with r/o access

Reason:

We will use an existing user as discussed on this change.

https://gerrit.wikimedia.org/r/1140118

isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.
isarantopoulos set the point value for this task to 3.
kevinbazira moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.

Hi @KartikMistry, we have uploaded all MinT models to a swift bucket:

$ ls /home/kartik/models
indictrans2  madlad400	nllb  opusmt  sha512sums.txt  softcatala
$
$ s3cmd -c /etc/s3cmd/cfg.d/ml-team.cfg ls -H s3://wmf-ml-models/mint/20250514081434/
    DIR  s3://wmf-ml-models/mint/20250514081434/indictrans2/
    DIR  s3://wmf-ml-models/mint/20250514081434/madlad400/
    DIR  s3://wmf-ml-models/mint/20250514081434/nllb/
    DIR  s3://wmf-ml-models/mint/20250514081434/opusmt/
    DIR  s3://wmf-ml-models/mint/20250514081434/softcatala/
2025-05-14 08:17   969   s3://wmf-ml-models/mint/20250514081434/sha512sums.txt

They can also be accessed via the public wmf model repo: https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434/

@kevinbazira there is something odd with the SHA512 in https://analytics.wikimedia.org/published/wmf-ml-models/mint/20250514081434, I see only sha512sums.txt but no .sha512 file (see for example https://analytics.wikimedia.org/published/wmf-ml-models/articlequality/fawiki/20221107044250/). I don't recall exactly the bash script but I suspect that it doesn't handle the fact that there are multiple subdirs after 20250514081434, it may assume that the model binaries are already there.

Ideally we should provide the .sha512 files for all the binaries, like it was done in https://phabricator.wikimedia.org/T391958#10794576.

@elukey the sha512sums.txt is based on /home/kartik/models/sha512sums.txt. I did check the SHA of the files and they match with T391958#10794576 e.g:

$ sha512sum /home/kartik/models/nllb/nllb200-600M.tgz
df07c9f2322105f797bd9f6e64e29c7bb2df9ffdc45e681ddf61c5a710cc3fcf34ec5a51c98492f2b4487bf20b0afdcdefe3ad01c0ed24e78a6e876110f38c40  /home/kartik/models/nllb/nllb200-600M.tgz

I will also add the .sha512 files for all the files.

@kevinbazira the sha512 files are automatically created by the script that uploads the model binaries, it is stored in the puppet's repo: ./modules/profile/templates/statistics/explorer/ml/model_upload.sh.erb

As far as I can see you used s3cmd put directly, that it is not something indicated in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy#How_to_upload_a_model_to_Swift

The script takes also care of publishing the files in the right statXXXX dir etc.., following our conventions. It doesn't support taking care of a big directory with model files, so rather than doing things by hand we should improve the script (we'll surely get another request like this in the future, and the less we do by hand the better).

For this use case it should be fine for you to upload the sha512 manually, but the script needs to be fixed and used in the future :) I'd also propose to switch to Python to have less bash constraints.

@elukey thank you for creating the task to create the new script.

I've added .sha512 files for all MinT models in the public repo.

@kevinbazira @elukey Thanks a lot for help and all work on this!