Page MenuHomePhabricator

MinT: Deployment timeouts for eqiad
Closed, ResolvedPublic4 Estimated Story PointsBUG REPORT

Description

MinT deployment timeouts when deploying in eqiad, most likely due to several times it has to download models for each worker. Deployment in codfw seems good so far.

This can be probably fixed by a similar fix (afdd1cf985842cf13d2eaaf86453fc618a03ab79) we did for recommendation-api-ng.

Event Timeline

Nikerabbit set the point value for this task to 4.
Nikerabbit moved this task from Backlog to Infrastructure on the MinT board.

Change #1125093 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] MinT: Increase rediness probe

https://gerrit.wikimedia.org/r/1125093

Change #1128067 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] MinT: staging: Increase rediness probe

https://gerrit.wikimedia.org/r/1128067

Change #1128067 merged by jenkins-bot:

[operations/deployment-charts@master] MinT: staging: Increase liveness probe

https://gerrit.wikimedia.org/r/1128067

I'm check with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1136148 - if bumping the chart helps to take the config (https://gerrit.wikimedia.org/r/1128067) we did into effect. Currently, diff isn't listing the config change in the staging.

Change #1125093 abandoned by KartikMistry:

[operations/deployment-charts@master] MinT: Increase liveness probe

Reason:

Not useful with S3, but can be restore if needed.

https://gerrit.wikimedia.org/r/1125093

Change #1162985 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] WIP: services/machinetranslation: adjust startup probe delays

https://gerrit.wikimedia.org/r/1162985

KartikMistry changed the task status from Stalled to In Progress.Jun 24 2025, 3:13 AM
Nikerabbit changed the task status from In Progress to Stalled.Jul 14 2025, 8:36 AM
Nikerabbit removed KartikMistry as the assignee of this task.
Nikerabbit moved this task from In Progress to Backlog on the LPL Essential (2025 Jul-Oct) board.

@klausman While deploying T335491, I didn't see any timeout for eqiad. Should we close this task?

Nikerabbit changed the task status from Stalled to In Progress.Jul 16 2025, 7:20 AM
Nikerabbit assigned this task to KartikMistry.
Nikerabbit moved this task from Backlog to In Progress on the LPL Essential (2025 Jul-Oct) board.

Marking as not stalled now that s3 migration is done.

@klausman While deploying T335491, I didn't see any timeout for eqiad. Should we close this task?

Should we go ahead?

@klausman While deploying T335491, I didn't see any timeout for eqiad. Should we close this task?

Should we go ahead?

Yes, please do! Sorry for the late response, I was out on PTO until today.

Nikerabbit moved this task from Backlog to Done on the LPL Essential (2025 Jul-Oct) board.
Nikerabbit removed a project: Patch-For-Review.

Change #1162985 abandoned by Nikerabbit:

[operations/deployment-charts@master] WIP: services/machinetranslation: adjust startup probe delays

Reason:

S3 migration makes this less/unnecessary.

https://gerrit.wikimedia.org/r/1162985