Page MenuHomePhabricator

MinT: Fails to download models/files from peopleweb.discovery.wmnet
Closed, ResolvedPublic8 Estimated Story PointsBUG REPORT

Description

During MinT deployment, I noticed that MinT is not able to download any files/models from the peopleweb.discovery.wmnet/~santhosh where we have stored all the models.

Log: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2025.01.15?id=SNSHaJQBNT5SlsGgxdKe

Similar timeouts can be noticed in production as well.

Event Timeline

KartikMistry triaged this task as Unbreak Now! priority.Jan 16 2025, 4:15 AM

This blocks further deployments of MinT and probably downtime for MinT too, setting it to Unbreak Now! priority.

LSobanski lowered the priority of this task from Unbreak Now! to High.Jan 16 2025, 9:03 AM
LSobanski added subscribers: MatthewVernon, LSobanski.

Can we treat this as a good opportunity to get back to https://phabricator.wikimedia.org/T335491 and migrate the models to object storage? The apus Ceph cluster with an S3 interface is now available and this looks like a reasonable use case for it (cc @MatthewVernon).

people.wikimedia.org "hosts some of the user public files" and should not be in the critical path for production services so I don't think this qualifies as Unbreak Now.

the download fails consistently indeed, and is logged as a 200:
2025-01-16T09:22:49 65001309 -/200 2464511302 GET http://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin - application/octet-stream - Wget/2.2.0 - - - - d66cd6e8-46cd-4e23-910f-5d8477edefa0

curl gives a bit more information about the download termination, apache still sends http/200:

curl: (92) HTTP/2 stream 1 was not closed cleanly: CANCEL (err 8)
HTTP Response Code: 200

Can we treat this as a good opportunity to get back to https://phabricator.wikimedia.org/T335491 and migrate the models to object storage? The apus Ceph cluster with an S3 interface is now available and this looks like a reasonable use case for it (cc @MatthewVernon).

The thanos swift cluster has had an account ready to go for this since June 2023.
[which is not to say this wouldn't be an apus use case, but if has become urgent, that might not be the best first-user-of-system]

I tried to downgrade to http/1.1 and added a bit more verbosity: P72105

with the same result (download fails with http/200), it seems to fail at a different offset every time as the volume of bytes missing varies from try to try

multipart download retrieves the file properly: P72108

Oddly, compressing the file fixes the issue, a duplication of the file reproduces the issue consistently:

$ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin
HTTP response 200  [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin]
/dev/null             79% [===============================================================================================================================================>                                      ]    1.81G   28.22MB/s
                          [Files: 1  Bytes: 1.81G [27.90MB/s] Redirects: 0  Todo: 0  Errors: 0                                                                                                                   ]

$ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.tar.gz
HTTP response 200  [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.tar.gz]
/dev/null            100% [=====================================================================================================================================================================================>]    1.34G   28.78MB/s
                          [Files: 1  Bytes: 1.34G [27.99MB/s] Redirects: 0  Todo: 0  Errors: 0                                                                                                                   ]

$ wget --verbose --output-document /dev/null https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin.1
HTTP response 200  [https://people.wikimedia.org/~santhosh/nllb/nllb200-600M/model.bin.1]
/dev/null             79% [===============================================================================================================================================>                                      ]    1.82G   28.18MB/s
                          [Files: 1  Bytes: 1.82G [27.94MB/s] Redirects: 0  Todo: 0  Errors: 0 

Oddly, compressing the file fixes the issue, a duplication of the file reproduces the issue consistently:

Is there anything changed recently that is causing this issue? MinT service downloads models on each restart/deployment.

I think this is due to the file size;I've tested to generate an empty 2GB file that has the same issue around the same offset (~1.80GB). This file is dated from Mar 6 2023, so I'm guessing that there is something that has changed on the server's configuration since then. I have found nothing yet, I'll keep on digging

@KartikMistry do you know if MinT ever successfully downloaded this version of the model?

@KartikMistry do you know if MinT ever successfully downloaded this version of the model?

No. But the nllb200-600M model is a primary model for the MinT - any downtime should have been reported much earlier.

Also, I deployed MinT successfully on 07 Nov 24, and it didn't appear to log any failure like this time.

Currently, MinT is down.

I've temporarily configured the vhost to be more "download friendly":

EnableSendfile Off
EnableMMAP Off
Timeout 600
KeepAliveTimeout 15
LimitRequestBody 0

Also tried using only LimitRequestBody 0 which seemed to be the issue from my previous tests, with no more success though.
but had no success retrieving the file, debug logging told me nothing more than the http/200 that we previously identified. Will continue the dig

Also, I deployed MinT successfully on 07 Nov 24, and it didn't appear to log any failure like this time.

it seems that the issue could come from Varnish terminating the transfer after a timeout is reached, maybe the transfer took less time in that previous deployment? question has been asked to Traffic on IRC

Change #1112056 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] peopleweb: request timeout to allow downloading larger files

https://gerrit.wikimedia.org/r/1112056

Change #1112056 merged by Arnaudb:

[operations/puppet@production] peopleweb: request timeout to allow downloading larger files

https://gerrit.wikimedia.org/r/1112056

after merging the CR, running puppet and trying again to download the file, this was not enough to fix. I've asked Traffic about Varnish and they told me it was probably not our culprit.

It seems that at least one pod is running fine on eqiad?

$ kube_env machinetranslation eqiad
$ kubectl get pods
NAME                                             READY   STATUS    RESTARTS            AGE
machinetranslation-production-687bb55f9d-2mdrs   2/3     Running   10965 (3m36s ago)   45d
machinetranslation-production-687bb55f9d-wt22h   3/3     Running   0                   59d <-- this

@KartikMistry @santhosh As this is proving to be more complex, I suggest looking into the correct fix - moving the files onto the Thanos Swift cluster.

@KartikMistry @santhosh As this is proving to be more complex, I suggest looking into the correct fix - moving the files onto the Thanos Swift cluster.

Indeed. I'll start working in that direction now.

Change #1112205 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] peopleweb: disable envoy request timeout, enable log

https://gerrit.wikimedia.org/r/1112205

@ABran-WMF Did we get any useful logs?

No, nothing useful so far!
Fwiw nllb200-600M.tgz is available under the same path and should not be subject to the limitation you had previously as it is under the fatal threshold.
Perhaps it could help working around this issue while storage is being migrated away from this server?

Change #1113775 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] models: Use compressed model to download

https://gerrit.wikimedia.org/r/1113775

Change #1113775 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] models: Use compressed model to download

https://gerrit.wikimedia.org/r/1113775

@KartikMistry just making sure that this is a short-term solution and it doesn't stop the work on moving the models to the Swift cluster

Change #1113775 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] models: Use compressed model to download

https://gerrit.wikimedia.org/r/1113775

Change #1115314 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2025-01-30-080456-production

https://gerrit.wikimedia.org/r/1115314

@KartikMistry just making sure that this is a short-term solution and it doesn't stop the work on moving the models to the Swift cluster

Yes! The long-term solution is what I'm working on.

Just to note - it seems the model download issue has been resolved since 24th Jan. I've not seen earlier failed download in the log also.

Did we change anything?

Change #1115314 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2025-02-05-115716-production

https://gerrit.wikimedia.org/r/1115314

Nikerabbit lowered the priority of this task from High to Medium.Feb 17 2025, 8:10 AM
Nikerabbit set the point value for this task to 8.

Mentioned in SAL (#wikimedia-operations) [2025-02-19T07:17:30Z] <kart_> Updated MinT to 2025-02-05-115716-production (T383750, T385552)

Closing as https://phabricator.wikimedia.org/T386889 is the next issue to fix for a longer term solution along with model storage.

Just to clarify, large files are still downloaded from people hosts?

Just to clarify, large files are still downloaded from people hosts?

Yes. We will keep it that way until model storage is sorted out in T335491

Change #1112205 abandoned by Arnaudb:

[operations/puppet@production] peopleweb: disable envoy request timeout, enable log

https://gerrit.wikimedia.org/r/1112205