Page MenuHomePhabricator

Create ORES migration endpoint (ORES/Liftwing translation)
Closed, ResolvedPublic

Description

We would like to create a migration endpoint for ORES that translates ORES calls to the corresponding ones for LiftWing and returns their results.

In the process of migrating ORES we want to be sure that anyone who uses ORES will redirect their traffic to LiftWing. However we can't control that everyone switches at the right time before we start to decommission ORES.
Such an approach has the two following main benefits:

  • We respect the community and the users by not putting a hard stop on the usage of ORES and decommision it in a graceful and respecful manner.
  • By creating such an application we can control when we can make the switch and LiftWing launch is not blocked by lack of adoption/migration.

The initial proposal is to deploy a FastAPI application on our Kubernetes cluster (a deployment and a service) that will translate ORES calls to LiftWing.
The two main reasons to implement the above with FastAPI and k8s are:

  • Ease of Development: it is easy to create a microservice with FastAPI that can handle async connections.
  • Scalability: If this application receives more traffic we can scale it in k8s by adding more replicas.

After we switch on LiftWing we can use ores.wikimedia.org as the endpoint of this application, monitor its traffic and communicate with the applications/users that still use ORES to transition to LiftWing.
Once this is done, the application can be deprecated.

Details

SubjectRepoBranchLines +/-
machinelearning/liftwing/inference-servicesmain+11 -8
operations/dnsmaster+1 -0
operations/puppetproduction+5 -0
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+3 -0
machinelearning/liftwing/inference-servicesmain+308 -9
machinelearning/liftwing/inference-servicesmain+12 -2
machinelearning/liftwing/inference-servicesmain+28 -7
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+4 -4
machinelearning/liftwing/inference-servicesmain+1 -1
machinelearning/liftwing/inference-servicesmain+2 -2
machinelearning/liftwing/inference-servicesmain+6 -1
machinelearning/liftwing/inference-servicesmain+26 -2
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+26 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+82 -0
operations/deployment-chartsmaster+4 -1
operations/deployment-chartsmaster+5 -3
operations/deployment-chartsmaster+80 -0
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+371 -0
labs/privatemaster+9 -0
operations/deployment-chartsmaster+1 K -0
machinelearning/liftwing/inference-servicesmain+1 K -0
machinelearning/liftwing/inference-servicesmain+86 -28
machinelearning/liftwing/inference-servicesmain+94 -9
machinelearning/liftwing/inference-servicesmain+17 -0
machinelearning/liftwing/inference-servicesmain+22 -7
machinelearning/liftwing/inference-servicesmain+135 -33
operations/puppetproduction+4 -0
integration/configmaster+15 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 904178 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: create FastAPI app ofr ores legacy

https://gerrit.wikimedia.org/r/904178

Change 904777 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: FastAPI chart using sextant for ores-legacy service

https://gerrit.wikimedia.org/r/904777

Change 908191 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: deployment of ores-legacy app in staging

https://gerrit.wikimedia.org/r/908191

Change 904777 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: FastAPI chart using sextant for ores-legacy service

https://gerrit.wikimedia.org/r/904777

Change 909974 had a related patch set uploaded (by Klausman; author: Klausman):

[labs/private@master] hiera: Add faux secrets for ores-legacy service on Lift Wing

https://gerrit.wikimedia.org/r/909974

Change 909974 merged by Klausman:

[labs/private@master] hiera: Add faux secrets for ores-legacy service on Lift Wing

https://gerrit.wikimedia.org/r/909974

Change 909992 had a related patch set uploaded (by Klausman; author: Tobias Klausmann):

[operations/deployment-charts@master] Lift Wing: Add new namespace for ores-legacy service

https://gerrit.wikimedia.org/r/909992

Change 904178 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: create FastAPI app for ores legacy

Reason:

We decided do proceed with a chart built using sextant module

https://gerrit.wikimedia.org/r/904178

Change 909993 had a related patch set uploaded (by Klausman; author: Tobias Klausmann):

[operations/deployment-charts@master] Lift Wing: Add new namespace for ores-legacy service

https://gerrit.wikimedia.org/r/909993

Change 909992 abandoned by Klausman:

[operations/deployment-charts@master] Lift Wing: Add new namespace for ores-legacy service

Reason:

Miscommunication within team

https://gerrit.wikimedia.org/r/909992

Change 909993 merged by Klausman:

[operations/deployment-charts@master] admin_ng: Add new namespace for the ores-legacy service on Lift Wing

https://gerrit.wikimedia.org/r/909993

Namespace has been created on staging, and is visible:

# kubectl get  namespace |grep -E '(NAME|ores)'
NAME                               STATUS   AGE
ores-legacy                        Active   2m10s
#

Change 908191 merged by Elukey:

[operations/deployment-charts@master] ml-services: deployment of ores-legacy app in staging

https://gerrit.wikimedia.org/r/908191

Change 910429 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set 'deploy' for ores-legacy in ml-serve's config

https://gerrit.wikimedia.org/r/910429

Change 910429 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set 'deploy' for ores-legacy in ml-serve's config

https://gerrit.wikimedia.org/r/910429

Change 910442 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set deployTLSCertificate for ores-legacy in ml clusters

https://gerrit.wikimedia.org/r/910442

Change 910442 merged by Elukey:

[operations/deployment-charts@master] admin_ng: set deployTLSCertificate for ores-legacy in ml clusters

https://gerrit.wikimedia.org/r/910442

Change 914313 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add private secretes to the ores-legacy helmfile config

https://gerrit.wikimedia.org/r/914313

Change 914313 merged by Elukey:

[operations/deployment-charts@master] ml-services: add private secretes to the ores-legacy helmfile config

https://gerrit.wikimedia.org/r/914313

Change 914319 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add network policies for ores-legacy

https://gerrit.wikimedia.org/r/914319

Change 914319 merged by Elukey:

[operations/deployment-charts@master] ml-services: add network policies for ores-legacy

https://gerrit.wikimedia.org/r/914319

Change 914730 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add the VIP settings for the K8s ingress for ml-staging

https://gerrit.wikimedia.org/r/914730

Change 914735 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add conftool and service config for k8s-ingress-ml-staging

https://gerrit.wikimedia.org/r/914735

Change 914786 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] ores-legacy: switch app port to 8080

https://gerrit.wikimedia.org/r/914786

Change 915727 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add env variable to ores-legacy

https://gerrit.wikimedia.org/r/915727

Change 915727 merged by Elukey:

[operations/deployment-charts@master] ml-services: add env variable to ores-legacy

https://gerrit.wikimedia.org/r/915727

Change 917352 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] fix(ores-legacy): filter context based on request

https://gerrit.wikimedia.org/r/917352

Change 917352 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] fix(ores-legacy): filter context based on request

https://gerrit.wikimedia.org/r/917352

Change 918362 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] ores-migration: add more logging when Lift Wing calls fail

https://gerrit.wikimedia.org/r/918362

Change 918362 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-migration: add more logging when Lift Wing calls fail

https://gerrit.wikimedia.org/r/918362

Change 918386 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] ores-migration: remove port definition from LIFTWING_URL

https://gerrit.wikimedia.org/r/918386

Change 918386 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-migration: remove port definition from LIFTWING_URL

https://gerrit.wikimedia.org/r/918386

Update from T335756:

We can now test the ores-legacy staging endpoint via the following (from any stat100x host for example):

elukey@stat1004:~$ time curl "https://ores-legacy.k8s-ml-staging.discovery.wmnet:31443/v3/scores/enwiki/123433/damaging" -i --http1.1 --resolve ores-legacy.k8s-ml-staging.discovery.wmnet:31443:10.192.0.201
HTTP/1.1 200 OK
date: Wed, 10 May 2023 08:45:12 GMT
server: istio-envoy
content-length: 377
content-type: application/json
x-envoy-upstream-service-time: 186

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    }, 
    "scores": {
      "123433": {
        "damaging": {
          "score": {
            "prediction": false, 
            "probability": {
              "false": 0.9899875154315122, 
              "true": 0.010012484568487832
            }
          }
        }
      }
    }
  }
}
real	0m0.301s
user	0m0.017s
sys	0m0.004s

Change 914786 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-legacy: switch app port to 8080

https://gerrit.wikimedia.org/r/914786

Change 918415 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] fastapi-app: change the default port settings from 80 to 8080

https://gerrit.wikimedia.org/r/918415

Change 918416 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: bump docker image for ores-legacy

https://gerrit.wikimedia.org/r/918416

Change 918415 merged by Elukey:

[operations/deployment-charts@master] fastapi-app: change the default port settings from 80 to 8080

https://gerrit.wikimedia.org/r/918415

Change 918416 merged by Elukey:

[operations/deployment-charts@master] ml-services: bump docker image for ores-legacy

https://gerrit.wikimedia.org/r/918416

Change 918483 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] fastapi-app: use port 8080 for probes as well

https://gerrit.wikimedia.org/r/918483

Change 918483 merged by Elukey:

[operations/deployment-charts@master] fastapi-app: use port 8080 for probes as well

https://gerrit.wikimedia.org/r/918483

Change 918484 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] fastapi-app: change app's port as well to 8080

https://gerrit.wikimedia.org/r/918484

Change 918484 merged by Elukey:

[operations/deployment-charts@master] fastapi-app: change app's port as well to 8080

https://gerrit.wikimedia.org/r/918484

Changed all the settings to use port 8080 instead of 80, works fine again!

At the moment there is some inconsistency in the error messages returned from ores vs ores-legacy:
Example: call for a non existing revision id
ORES
curl https://ores.wikimedia.org/v3/scores/enwiki/1234123123132133/damaging -i --http1.1

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    },
    "scores": {
      "1234123123132133": {
        "damaging": {
          "error": {
            "message": "RevisionNotFound: Could not find revision ({revision}:1234123123132133)",
            "type": "RevisionNotFound"
          }
        }
      }
    }
  }
}

ores-legacy

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    },
    "scores": {
      "1234123123132133": {
        "damaging": {
          "error": {
            "message": "The MW API does not have any info related to the rev-id provided as input (1234123123132133), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info.",
            "type": "Bad Request"
          }
        }
      }
    }
  }

ores-legacy returns the message as it is returned from Lift Wing. We can manipulate the LiftWing response and return exactly the same message. At least keeping the same error type (RevisionNotFound) is important for consumers.
A note here is that both APIs return a 200 response code in most cases which in fact should probably be a 404. (although when one requests multiple revisions and some of them exist it becomes messy).

Prod endpoints up!

elukey@stat1004:~$ time curl "https://ores-legacy.discovery.wmnet:31443/v3/scores/enwiki/123433/damaging" -i --http1.1 
HTTP/1.1 200 OK
date: Wed, 17 May 2023 15:54:01 GMT
server: istio-envoy
content-length: 377
content-type: application/json
x-envoy-upstream-service-time: 135

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    }, 
    "scores": {
      "123433": {
        "damaging": {
          "score": {
            "prediction": false, 
            "probability": {
              "false": 0.9899875154315122, 
              "true": 0.010012484568487832
            }
          }
        }
      }
    }
  }
}
real	0m0.168s
user	0m0.013s
sys	0m0.003s

@isarantopoulos I have a proposal to ease the transition from ORES to ORES legacy, lemme know what you think:

  • We add support in ores-legacy to fetch/set scores in the ORES Redis cache. Same functionality as ORES exposes, we check how it uses and sets the cache and we implement it (I believe it should be really straightforward).
  • We set another change prop revision-score-like stream that hits ores-legacy, to see how cache priming works in there etc.. In theory we'll have two identical systems trying to set the cache at the same time, shouldn't be a huge deal (given how it is implemented). The priming stream for ores-legacy will not generate any stream, it will only be used to see if ores-legacy can keep up with timings etc..
  • Finally, when we are happy, we do two things:
  • We point change prop's revision-score config to ores-legacy.
  • We point ores.wikimedia.org to ores-legacy.discovery.wmnet

I don't like a lot the above but I think it is the only way to allow clients like Wikimedia Enterprise to prepare the migration to Lift Wing without impacting our September deadline. After September we'll follow up with Wikimedia Enterprise and other folks to migrate a way from ores-legacy, but with a slower pace.

What do you think? Ideas?

I find the above idea as the best compromise at the moment as I can't think of another way to meet the deadline for ORES deprecation other than adding a cache in ores legacy.
We have two options:

  • implement a global caching solution for Lift Wing

or

  • implement caching just for the models that WME uses, a temporary solution

Given a global solution would require a lot of effort I agree with the above suggestion as a starting point to figure out a viable solution.
Since we are adding hacks/ad-hoc stuff in order to facilitate an easier transition for Lift Wing adoption it makes it cleaner for these solutions to exist under the ores-legacy umbrella than injecting tech debt in LW here and there.

Putting together a list of missing things from ores-legacy service related to ORES

  • Swagger UI: enrich the UI to bring it closer to https://ores.wikimedia.org/v3/
  • Return features in response: when a user makes a request similar to https://ores.wikimedia.org/v3/scores/enwiki/123456/goodfaith?features that queries for features we will need to manipulate the LW call by adding the extended_output flag in the POST body e.g. {"rev_id": 123456, "extended_output": "True"}'
  • model_info query parameters (stats). This relates to the discussion about thresholds : at this moment this is not intended to be supported and thresholds used by the extension are hardcoded in mediawiki-config (in an ugly way). If this ends up as a requirement we will need to modify the revscoring model servers accordingly. Apart from stats we'll need to investigate what other model_info can be exposed and if they are used.
  • injecting keyword parameter: this parameter allows to inject a json payload of pre-computed datasources/features to be used for scoring. No support for this has been discussed yet (which also seems difficult to implement in LW.
  • LW exceptions and error handling: document LW errors and manipulate them accordingly to match the expected ORES errros (e.g. when searching for revisions that don't exist etc)

Change 928055 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] ores-legacy: return features in response

https://gerrit.wikimedia.org/r/928055

Change 928055 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-legacy: return features in response

https://gerrit.wikimedia.org/r/928055

Change 929743 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] feat: add Response Models in ores-legacy API

https://gerrit.wikimedia.org/r/929743

We'll need to check which of the following errors we need to support (if not all of them)
https://github.com/wikimedia/revscoring/blob/master/revscoring/errors.py
For now I am adding a patch to transform the LW response of missing rev id to match that of ORES which means changing the response of

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    }, 
    "scores": {
      "123122329371": {
        "damaging": {
          "error": {
            "message": "The MW API does not have any info related to the rev-id provided as input (123122329371), therefore it is not possible to extract features properly. One possible cause is the deletion of the page related to the revision id. Please contact the ML-Team if you need more info.", 
            "type": "Bad Request"
          }
        }
      }
    }
  }
}

to this

{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.1"
      }
    }, 
    "scores": {
      "123122329371": {
        "damaging": {
          "error": {
            "message": "RevisionNotFound: Could not find revision ({revision}:123122329371)", 
            "type": "RevisionNotFound"
          }
        }
      }
    }
  }
}

Change 930166 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] ores-legacy: Change message in RevisionNotFound error

https://gerrit.wikimedia.org/r/930166

I have also added some example responses in a json file. These are defined as examples in FastAPI endpoints and are used to show sample responses in the Swagger UI (under https://YOUR_CUSTOM_URL/docs)

Screenshot 2023-06-14 at 6.34.01 PM.png (1×1 px, 156 KB)

Change 930166 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-legacy: Change message in RevisionNotFound error

https://gerrit.wikimedia.org/r/930166

Change 929743 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] feat: add Response Models in ores-legacy API

https://gerrit.wikimedia.org/r/929743

Change 933866 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::cache::text: add ores-legacy.wikimedia.org

https://gerrit.wikimedia.org/r/933866

Change 933869 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/dns@master] Add ores-legacy.wikimedia.org

https://gerrit.wikimedia.org/r/933869

Change 933870 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add extra SANs to the ores-legacy's TLS config for ml-serve

https://gerrit.wikimedia.org/r/933870

Change 933870 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add extra SANs to the ores-legacy's TLS config for ml-serve

https://gerrit.wikimedia.org/r/933870

Change 933874 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: add new FQDNs to ores-legacy's ingress config

https://gerrit.wikimedia.org/r/933874

Change 933874 merged by Elukey:

[operations/deployment-charts@master] ml-services: add new FQDNs to ores-legacy's ingress config

https://gerrit.wikimedia.org/r/933874

Change 933866 merged by Elukey:

[operations/puppet@production] role::cache::text: add ores-legacy.wikimedia.org

https://gerrit.wikimedia.org/r/933866

Change 933869 merged by Elukey:

[operations/dns@master] Add ores-legacy.wikimedia.org

https://gerrit.wikimedia.org/r/933869

Change 934343 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] ores-legacy: fix markdown in docs and remove root

https://gerrit.wikimedia.org/r/934343

Change 934343 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] ores-legacy: fix markdown in docs and remove root

https://gerrit.wikimedia.org/r/934343

We added https://ores-legacy.wikimedia.org/ and the idea is to eventually add a CNAME ores.wikimedia.org -> ores-legacy.wikimedia.org, that contains a big banner for users to start using Lift Wing etc..