Page MenuHomePhabricator

Migrate ml-services to mw-api-int
Open, MediumPublic5 Estimated Story Points

Description

As these services don't use the envoy service mesh to reach out to mediawiki, they were not taken into account for the migration of internal services.

The following services reach out directly to the bare-metal cluster url api-ro.discovery.wmnet

servicedoneCR
article-descriptionsstaging prod
articletopic-outlinkstaging prod
experimentalprod
readabilitystaging prod
revertriskstaging prod
revscoring-articlequalitystaging prod
revscoring-articletopicstaging prod
revscoring-draftqualitystaging prod
revscoring-drafttopicstaging prod
revscoring-editquality-damagingstaging prod
revscoring-editquality-goodfaithstaging prod
revscoring-editquality-revertedstaging prod

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -3
operations/deployment-chartsmaster+2 -3
operations/deployment-chartsmaster+0 -2
operations/deployment-chartsmaster+0 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -2
operations/deployment-chartsmaster+3 -4
operations/deployment-chartsmaster+0 -2
operations/deployment-chartsmaster+0 -2
operations/deployment-chartsmaster+89 -154
operations/deployment-chartsmaster+23 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+15 -0
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+11 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+19 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+92 -0
operations/deployment-chartsmaster+52 -0
Show related patches Customize query in gerrit

Event Timeline

Change #1018959 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] article-description: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018959

Change #1018960 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] article-description: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018960

Change #1018961 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] articletopic-outlink: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018961

Change #1018962 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] articletopic-outlink: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018962

Change #1018963 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] experimental: Switch to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018963

Change #1018964 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] readability: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018964

Change #1018965 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] readability: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018965

Change #1018986 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revertrisk: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018986

Change #1018987 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revertrisk: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018987

Change #1018988 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-articlequality: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018988

Change #1018989 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-articlequality: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018989

Change #1018990 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-articletopic: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018990

Change #1018991 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-articletopic: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018991

Change #1018992 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-draftquality: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018992

Change #1018993 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-draftquality: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018993

Change #1018994 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-drafttopic: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018994

Change #1018995 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-drafttopic: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018995

Change #1018996 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-damaging: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018996

Change #1018997 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-damaging: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018997

Change #1018998 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018998

Change #1018999 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018999

Change #1019000 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-reverted: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1019000

Change #1019001 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] revscoring-editquality-reverted: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1019001

Aaaand I just realized they all use http and not https, so now I can change them all.

Change #1019061 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] ml-serve: Add istio config for mw-api-int-ro

https://gerrit.wikimedia.org/r/1019061

Change #1019074 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] ml-staging-codfw: Override mediawiki-app-vs

https://gerrit.wikimedia.org/r/1019074

Change #1019061 merged by Elukey:

[operations/deployment-charts@master] ml-serve: Add istio config for mw-api-int-ro

https://gerrit.wikimedia.org/r/1019061

Change #1019074 merged by Elukey:

[operations/deployment-charts@master] ml-staging-codfw: Override mediawiki-app-vs

https://gerrit.wikimedia.org/r/1019074

Change #1018996 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-damaging: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018996

Change #1019288 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: override port 80 with 4680 in istio configs for ml-staging

https://gerrit.wikimedia.org/r/1019288

Change #1018959 merged by Elukey:

[operations/deployment-charts@master] article-description: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018959

Change #1019288 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add port 4680 in istio configs for ml-staging

https://gerrit.wikimedia.org/r/1019288

Change #1018961 merged by Elukey:

[operations/deployment-charts@master] articletopic-outlink: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018961

Change #1018988 merged by Elukey:

[operations/deployment-charts@master] revscoring-articlequality: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018988

Change #1018990 merged by Elukey:

[operations/deployment-charts@master] revscoring-articletopic: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018990

Change #1018992 merged by Elukey:

[operations/deployment-charts@master] revscoring-draftquality: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018992

Change #1018994 merged by Elukey:

[operations/deployment-charts@master] revscoring-drafttopic: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018994

Change #1018998 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-goodfaith: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018998

Change #1019000 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-reverted: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1019000

Change #1018964 merged by Elukey:

[operations/deployment-charts@master] readability: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018964

Change #1018963 merged by jenkins-bot:

[operations/deployment-charts@master] experimental: Switch to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018963

Change #1018986 merged by Elukey:

[operations/deployment-charts@master] revertrisk: Switch staging to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018986

Current status:

  • all services deployed in ml-staging, need to double check that all the pods are running but so far I didn't notice a problem.
  • plan the work for prod (we'll likely need to depool one DC at the time when we do it)

Added some thoughts to T353622#9723070, I found out a big can of worms while testing staging :) The upgrade is more complex than anticipated, but we should be able to do it this or next week maximum.

Change #1021490 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: move Istio configs to mw-api-int-ro for ml-serve

https://gerrit.wikimedia.org/r/1021490

Change #1021490 merged by Elukey:

[operations/deployment-charts@master] admin_ng: move Istio configs to mw-api-int-ro for ml-serve

https://gerrit.wikimedia.org/r/1021490

After a lot of tests and config changes, we are almost ready to proceed with prod. Hopefully we'll get to it on April 2nd.

All changes rebased and ready to go (for prod). The main idea is the following:

  • Remove WIKI_URL for revscoring isvcs, so we'll use the transparent proxy functionality.
  • Move the rest to the new mw-api-int-ro discovery endpoint, and slowly migrate to transparent proxy later on.

Overall procedure to upgrade:

  1. Depool codfw from inference.discovery.wmnet
  2. Wait some time for traffic to drain, confirm via Grafana metrics before proceeding.
  3. Apply admin_ng changes for knative-serving (new Istio configs, already ready to go)
  4. Merge one of the revscoring prod endpoint changes (see task description), and test the isvc
  5. Merge one of the non-revscoring endpoint changes, and check the isvc.
  6. Merge and rollout all the changes.
  7. Use httpbb from deploy1002 to check all the isvcs
  8. Once ready, repool traffic.

I'd say to wait a couple of days to observe anomalies/errors/etc.., and then proceed with eqiad. The idea is that if anything is really wrong with codfw, we can depool very quickly and investigate.

Timeline:

  • Move codfw to the new endpoint on April 2nd
  • Move eqiad to the new endpoint on April 6th

Change #1018997 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-damaging: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018997

Change #1018960 merged by Elukey:

[operations/deployment-charts@master] article-description: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018960

Change #1018965 merged by Elukey:

[operations/deployment-charts@master] readability: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018965

Change #1018987 merged by Elukey:

[operations/deployment-charts@master] revertrisk: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018987

Change #1018989 merged by Elukey:

[operations/deployment-charts@master] revscoring-articlequality: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018989

Change #1018991 merged by Elukey:

[operations/deployment-charts@master] revscoring-articletopic: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018991

Change #1018999 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018999

Change #1019001 merged by Elukey:

[operations/deployment-charts@master] revscoring-editquality-reverted: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1019001

Change #1018962 merged by Elukey:

[operations/deployment-charts@master] articletopic-outlink: Switch prod to mw-api-int-ro

https://gerrit.wikimedia.org/r/1018962

Status: Lift Wing codfw has been migrated successfully, we are going to do eqiad on Monday 6th.

And eqiad migrated as well, all done :)

elukey set the point value for this task to 5.Mon, May 6, 1:30 PM
elukey moved this task from In Progress to 2023-2024 Q4 Done on the Machine-Learning-Team board.