During MinT deployment, I noticed that MinT is not able to download any files/models from the peopleweb.discovery.wmnet/~santhosh where we have stored all the models.
Similar timeouts can be noticed in production as well.
During MinT deployment, I noticed that MinT is not able to download any files/models from the peopleweb.discovery.wmnet/~santhosh where we have stored all the models.
Similar timeouts can be noticed in production as well.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | santhosh | T331505 Self hosted machine translation service | |||
| Resolved | BUG REPORT | KartikMistry | T386889 MinT: Deployment timeouts for eqiad | ||
| Resolved | KartikMistry | T335491 Provide better long-term storage for translation models | |||
| Resolved | BUG REPORT | Samwilson | T384555 MinT translation fails with 503 error | ||
| Resolved | BUG REPORT | KartikMistry | T383750 MinT: Fails to download models/files from peopleweb.discovery.wmnet |
This blocks further deployments of MinT and probably downtime for MinT too, setting it to Unbreak Now! priority.
Can we treat this as a good opportunity to get back to https://phabricator.wikimedia.org/T335491 and migrate the models to object storage? The apus Ceph cluster with an S3 interface is now available and this looks like a reasonable use case for it (cc @MatthewVernon).
people.wikimedia.org "hosts some of the user public files" and should not be in the critical path for production services so I don't think this qualifies as Unbreak Now.
Attempting a local download:
the download fails consistently indeed, and is logged as a 200:
2025-01-16T09:22:49 65001309 -/200 2464511302 GET http://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin - application/octet-stream - Wget/2.2.0 - - - - d66cd6e8-46cd-4e23-910f-5d8477edefa0
curl gives a bit more information about the download termination, apache still sends http/200:
curl: (92) HTTP/2 stream 1 was not closed cleanly: CANCEL (err 8) HTTP Response Code: 200
The thanos swift cluster has had an account ready to go for this since June 2023.
[which is not to say this wouldn't be an apus use case, but if has become urgent, that might not be the best first-user-of-system]
I tried to downgrade to http/1.1 and added a bit more verbosity: P72105
with the same result (download fails with http/200), it seems to fail at a different offset every time as the volume of bytes missing varies from try to try
Oddly, compressing the file fixes the issue, a duplication of the file reproduces the issue consistently:
$ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin HTTP response 200 [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin] /dev/null 79% [===============================================================================================================================================> ] 1.81G 28.22MB/s [Files: 1 Bytes: 1.81G [27.90MB/s] Redirects: 0 Todo: 0 Errors: 0 ] $ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.tar.gz HTTP response 200 [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.tar.gz] /dev/null 100% [=====================================================================================================================================================================================>] 1.34G 28.78MB/s [Files: 1 Bytes: 1.34G [27.99MB/s] Redirects: 0 Todo: 0 Errors: 0 ] $ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin.1 HTTP response 200 [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin.1] /dev/null 79% [===============================================================================================================================================> ] 1.82G 28.18MB/s [Files: 1 Bytes: 1.82G [27.94MB/s] Redirects: 0 Todo: 0 Errors: 0
Is there anything changed recently that is causing this issue? MinT service downloads models on each restart/deployment.
I think this is due to the file size;I've tested to generate an empty 2GB file that has the same issue around the same offset (~1.80GB). This file is dated from Mar 6 2023, so I'm guessing that there is something that has changed on the server's configuration since then. I have found nothing yet, I'll keep on digging
@KartikMistry do you know if MinT ever successfully downloaded this version of the model?
No. But the nllb200-600M model is a primary model for the MinT - any downtime should have been reported much earlier.
Also, I deployed MinT successfully on 07 Nov 24, and it didn't appear to log any failure like this time.
Currently, MinT is down.
I've temporarily configured the vhost to be more "download friendly":
EnableSendfile Off EnableMMAP Off Timeout 600 KeepAliveTimeout 15 LimitRequestBody 0
Also tried using only LimitRequestBody 0 which seemed to be the issue from my previous tests, with no more success though.
but had no success retrieving the file, debug logging told me nothing more than the http/200 that we previously identified. Will continue the dig
it seems that the issue could come from Varnish terminating the transfer after a timeout is reached, maybe the transfer took less time in that previous deployment? question has been asked to Traffic on IRC
Change #1112056 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] peopleweb: request timeout to allow downloading larger files
Change #1112056 merged by Arnaudb:
[operations/puppet@production] peopleweb: request timeout to allow downloading larger files
after merging the CR, running puppet and trying again to download the file, this was not enough to fix. I've asked Traffic about Varnish and they told me it was probably not our culprit.
It seems that at least one pod is running fine on eqiad?
$ kube_env machinetranslation eqiad $ kubectl get pods NAME READY STATUS RESTARTS AGE machinetranslation-production-687bb55f9d-2mdrs 2/3 Running 10965 (3m36s ago) 45d machinetranslation-production-687bb55f9d-wt22h 3/3 Running 0 59d <-- this
@KartikMistry @santhosh As this is proving to be more complex, I suggest looking into the correct fix - moving the files onto the Thanos Swift cluster.
Change #1112205 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] peopleweb: disable envoy request timeout, enable log
No, nothing useful so far!
Fwiw nllb200-600M.tgz is available under the same path and should not be subject to the limitation you had previously as it is under the fatal threshold.
Perhaps it could help working around this issue while storage is being migrated away from this server?
Change #1113775 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[mediawiki/services/machinetranslation@master] models: Use compressed model to download
@KartikMistry just making sure that this is a short-term solution and it doesn't stop the work on moving the models to the Swift cluster
Change #1113775 merged by jenkins-bot:
[mediawiki/services/machinetranslation@master] models: Use compressed model to download
Change #1115314 had a related patch set uploaded (by KartikMistry; author: KartikMistry):
[operations/deployment-charts@master] Update MinT to 2025-01-30-080456-production
Yes! The long-term solution is what I'm working on.
Just to note - it seems the model download issue has been resolved since 24th Jan. I've not seen earlier failed download in the log also.
Did we change anything?
Change #1115314 merged by jenkins-bot:
[operations/deployment-charts@master] Update MinT to 2025-02-05-115716-production
Mentioned in SAL (#wikimedia-operations) [2025-02-19T07:17:30Z] <kart_> Updated MinT to 2025-02-05-115716-production (T383750, T385552)
Closing as https://phabricator.wikimedia.org/T386889 is the next issue to fix for a longer term solution along with model storage.
Change #1112205 abandoned by Arnaudb:
[operations/puppet@production] peopleweb: disable envoy request timeout, enable log