Page MenuHomePhabricator

Merge articletopic outlink model transformer and predictor pods
Closed, ResolvedPublic

Description

As an ML engineer,
I would like to merge the transformer and predictor steps that make up the outlink articletopic model, so that I can:

In T287056: Deploy Outlinks topic model to production we deployed the articletopic outlink topic using the transformer-predictor paradigm from kserve. This paradigm is a great one if we want to generalize and re-use the same transformers in multiple services/models.
However, since there is no pattern of reusability here we would like to merge this in one step. This would result in a single-file declaration of the kserve model in the same way that we do in the other services that we run.

Event Timeline

BWojtowicz-WMF changed the task status from Open to In Progress.Sep 12 2025, 8:20 AM
BWojtowicz-WMF claimed this task.

Change #1187739 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] outlink-topic-model: Merge transformer and predictor pods.

https://gerrit.wikimedia.org/r/1187739

Change #1187752 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[integration/config@master] inference-services: Remove the outlink-transformer CI job.

https://gerrit.wikimedia.org/r/1187752

Sharing here a summary of recent IRC discussions about how this change will be rolled out.
There is the issue that the CI jobs will fail because the transformer image has test and publish pipelines which fail if we remove the blubber + code definitions.
So we decided to do the following: (pasting directly form IRC)

  1. leave the transformer as is and add the preprocessing functionality in the predictor part
  2. deploy it and test it using only the predictor pod
  3. once we know everything is ok, clean up the transformer image along with the CI jobs

Change #1187739 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] outlink-topic-model: Merge transformer and predictor pods.

https://gerrit.wikimedia.org/r/1187739

Change #1189839 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Update the articletopic model on staging.

https://gerrit.wikimedia.org/r/1189839

Change #1189839 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Update the articletopic model on staging.

https://gerrit.wikimedia.org/r/1189839

Change #1189850 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Remove the transformer pod for articletopic on staging.

https://gerrit.wikimedia.org/r/1189850

Change #1189850 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Remove the transformer pod for articletopic on staging.

https://gerrit.wikimedia.org/r/1189850

Change #1189871 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update articletopic in prod and remove trasnsformer

https://gerrit.wikimedia.org/r/1189871

Change #1189871 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update articletopic in prod and remove trasnsformer

https://gerrit.wikimedia.org/r/1189871

Change #1190185 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[operations/deployment-charts@master] ml-services: Remove the transformer config from articletopic staging.

https://gerrit.wikimedia.org/r/1190185

Change #1190185 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: Remove the transformer config from articletopic staging.

https://gerrit.wikimedia.org/r/1190185

In https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1187739, we've combined the transformer and predictor logic into a single pod. Now, the full processing is done by a single predictor pod.

We've deployed the change on the staging cluster and ran a load-test against the new deployment. The results show a small performance improvement over the previous setup:

Load Test on new deployment

   Type Name                 Request Count Failure Count Median Response Time Average Response Ti… Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99%  99.9% 99.99% 100%
1  POST /v1/models/outlink-… 74            0             120                  163.                 70.1              1066.             282.                 0.64       0          120 150 180 220 310 340 460 1100 1100  1100   1100
2  NA   Aggregated           74            0             120                  163.                 70.1              1066.             282.                 0.64       0          120 150 180 220 310 340 460 1100 1100  1100   1100

Load Test on old deployment

   Type Name                 Request Count Failure Count Median Response Time Average Response Ti… Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
1  POST /v1/models/outlink-… 74            0             160                  206.                 83.0              479.              277.                 0.625      0          170 220 290 310 350 450 470 480 480   480    480
2  NA   Aggregated           74            0             160                  206.                 83.0              479.              277.                 0.625      0          170 220 290 310 350 450 470 480 480   480    480

Next step will involve deployment on production.


We've also experienced an issue with Kserve. After removing the transformer component from the valus.yaml file of the articletopic model deployment, the Kserve controller was still recreating the latest available revision of transformer component.
Currently, we could fix it by re-creating the InferenceService resource entirely. However, more in-depth exploration is needed to fully understand why Kserve does not remove the removed component.

Change #1190570 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] articletopic: Remove the transformer code.

https://gerrit.wikimedia.org/r/1190570

Change #1190571 had a related patch set uploaded (by Bartosz Wójtowicz; author: Bartosz Wójtowicz):

[machinelearning/liftwing/inference-services@main] articletopic: Update locust test results.

https://gerrit.wikimedia.org/r/1190571

The merged architecture has been deployed on both staging and production clusters. It's also been tested by sending requests manually and verifying the responses are correct.

The remaining work includes:

  1. Cleaning up the transformer CI jobs from integration/config repo: https://gerrit.wikimedia.org/r/c/integration/config/+/1187752
  2. Removing the transformer code from inference-services repo: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1190570
  3. Updating the locust results for the model: https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/1190571

Change #1187752 merged by jenkins-bot:

[integration/config@master] inference-services: Remove the outlink-transformer CI job.

https://gerrit.wikimedia.org/r/1187752

Change #1190570 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] articletopic: Remove the transformer code.

https://gerrit.wikimedia.org/r/1190570

Change #1190571 abandoned by Bartosz Wójtowicz:

[machinelearning/liftwing/inference-services@main] articletopic: Update locust test results.

Reason:

New locust tests have already been uploaded when adding `page_id` to the model.

https://gerrit.wikimedia.org/r/1190571