Page MenuHomePhabricator

Migrate Machine-generated Article Descriptions from toolforge to liftwing.
Open, LowPublic

Description

The apps teams would like to migrate https://ml-article-descriptions.toolforge.org/ off of toolforge into a more production setting and its been suggested that Liftwing should be the destination.

What use case is the model going to support/resolve?
Android app users currently have an entry point for adding Wikidata descriptions to Wikipedia articles that are missing them. This model provides recommended descriptions to add to simplify the process and help codify norms around what these descriptions should look like. More background: https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/Android/Machine_Assisted_Article_Descriptions

Do you have a model card? If you don't know what it is, please check https://meta.wikimedia.org/wiki/Machine_learning_models.
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_descriptions

What team created/trained/etc.. the model? What tools and frameworks have you used?
A group of external researchers and often-collaborators from EPFL trained the model. More details in the paper (arxiv). Essentially they're using a modified transformers library to merge an mBART model that takes article paragraphs as input and an mBERT model that takes existing article descriptions from other languages as input (paper overview). NOTE: there are some details in the paper such as the Wikidata knowledge graph embeddings that were not used in the deployed model.

Here's their raw code: https://github.com/epfl-dlab/transformers-modified (and a more general repo that also includes code for training)
And the API code that uses the model and has a copy of the modified transformers library: https://github.com/geohci/transformers-modified/blob/main/artdescapi/wsgi_template.py
The raw PyTorch model binary and supporting config etc. can be found here: https://drive.google.com/file/d/1bhn5O2WW6uXo4UvKDFoHqQnc0ozCCXmi/view?usp=sharing

What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?

  • Input: Wikipedia article
  • Features: first paragraph of that article (if it exists -- not strictly necessary), first paragraph of the article in any of the 24 other languages supported by the model, existing article descriptions in any of the other 24 languages.
  • Output: k suggested article descriptions where k is an adjustable parameter but we've found that generating the top 3 and recommending the top 2 from that seems to best balance utility, quality, and diversity.

If you have a minimal codebase that you used to run the first tests with the model, could you please share it?

State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).
Joseph Seddon

What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
Generally looking at 3-4 seconds depending on the input. You can test it out with various inputs at this UI for the Cloud VPS endpoint and see some stats on latency from this notebook. It's highly dependent on how much content exists -- e.g., articles with many language editions + existing article descriptions are a good bit slower (upwards of 10+ seconds) than articles that only exist in a single language (1-2 seconds). There is some allowance for latency of a second or two in the UI but obviously any speed-ups would be nice and there might be ways to downsample the languages considered for highly multilingual topics to reduce some of the worse-case latencies. A GPU would obviously speed it up but it's possible that something like CTranslate2 could be applied or even just that LiftWing has better hardware for this use-case than Cloud Services and it won't be an issue. These more multilingual articles often already have article descriptions too so are less likely to need the tool.

Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
At this point, there hasn't been discussion around re-training. Fine-tuning is certainly possible but I think not too urgent given that there's a pretty large dataset of existing Wikidata descriptions that the model was trained on and I don't foresee a massive amount of data drift. This repo has some details on training.

Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
A few analyses:

  • Initial potential harm exploration that led us to institute guardrails around which editors will have access to recommendations for biographies of living people.
  • The tool was piloted for several weeks in early 2023 and the edits made with the tool were evaluated by a number of volunteers across different languages (slide deck).

Anything thing else that is relevant in your opinion :)

  • If I foresee a potential engineering challenge, it's that the model inference code currently depends on a modified, static snapshot of the transformers repo. Long-term, that might not be desirable as it would be hard to keep it up-to-date with improvements made to the broader transformers library. I'm not sure how feasible it is, but it might be worth considering whether it's possible to convert it from a modified snapshot of the transformers library to something more like a wrapper around it.
  • There is an additional issue that was discovered during this testing which is that occasionally the model will "hallucinate" dates of births for people. This stems from a difficulty with these language models in handling numbers (they tend to tokenize them as single digits and so have trouble reasoning across things like dates) as well as the source of training data (TextExtracts), which removes dates from the article lead and thus likely make it more difficult for the model to learn how to handle them. I'm open to a discussion on how to handle this but after chatting with the EPFL researchers about it, I personally think it'd be easy and reasonable to generate a simple filter that removes any recommendations that contain a date that is not seen in the input paragraph / description data.

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+1 -3
machinelearning/liftwing/inference-servicesmain+6 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+1 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+0 -1
operations/deployment-chartsmaster+3 -1
machinelearning/liftwing/inference-servicesmain+2 -1
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+65 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
machinelearning/liftwing/inference-servicesmain+11 -6
operations/deployment-chartsmaster+19 -0
machinelearning/liftwing/inference-servicesmain+35 -32
operations/deployment-chartsmaster+19 -0
machinelearning/liftwing/inference-servicesmain+530 -0
integration/configmaster+15 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 970831 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: add article-descriptions model server

https://gerrit.wikimedia.org/r/970831

Change 975929 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add article-descriptions isvc to experimental namespace

https://gerrit.wikimedia.org/r/975929

Change 975929 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add article-descriptions isvc to experimental namespace

https://gerrit.wikimedia.org/r/975929

Change 975936 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update model-server to use local files only

https://gerrit.wikimedia.org/r/975936

Change 975936 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: update model-server to use local files only

https://gerrit.wikimedia.org/r/975936

Change 976960 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/976960

Change 976960 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/976960

Change 976965 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update wikidata host header in model-server

https://gerrit.wikimedia.org/r/976965

Change 977226 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update wiki host headers in model-server

https://gerrit.wikimedia.org/r/977226

Change 977226 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: update wiki host headers in model-server

https://gerrit.wikimedia.org/r/977226

Change 977234 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977234

Change 977234 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977234

Change 977714 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977714

Change 977714 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977714

Change 978059 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve/istio: Add Restbase as a handled destination for requests from LW

https://gerrit.wikimedia.org/r/978059

Change 978059 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve/istio: Add Restbase as a handled destination for requests from LW

https://gerrit.wikimedia.org/r/978059

Change 978098 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve/istio: fix wrong port in destination rule for restgw

https://gerrit.wikimedia.org/r/978098

Change 978098 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve/istio: fix wrong port in destination rule for restgw

https://gerrit.wikimedia.org/r/978098

Change 976965 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: fix wikipedia api summary endpoint

https://gerrit.wikimedia.org/r/976965

Change 978168 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978168

Change 978168 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978168

Change 978170 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: remove host header from rest-gateway endpoint

https://gerrit.wikimedia.org/r/978170

Change 978170 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: remove host header from rest-gateway endpoint

https://gerrit.wikimedia.org/r/978170

Change 978171 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978171

Change 978171 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978171

Change 978542 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] article-descriptions: use a dedicated aiohttp session for rest-gateway

https://gerrit.wikimedia.org/r/978542

Change 978542 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: use a dedicated aiohttp session for rest-gateway

https://gerrit.wikimedia.org/r/978542

Change 978651 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978651

Change 978651 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978651

Change 979042 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services/artfcle-description: set OMP_NUM_THREADS=1

https://gerrit.wikimedia.org/r/979042

Change 979042 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: article-description set OMP_NUM_THREADS=1

https://gerrit.wikimedia.org/r/979042

The article-descriptions model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure:

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.038863420486450195,"total network (s)":0.367002010345459,"model (s)":14.40719723701477,"total (s)":14.774216890335083},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m14.815s
user	0m0.013s
sys	0m0.000s

@Isaac and @Seddon please test it and let us know of any edge cases you may come across. Once you have confirmed that there are none, we shall prepare to move it to production and provide an external endpoint.

Thanks @kevinbazira ! Awesome to see this working! A bug I uncovered below and then a few thoughts:

Bug

Any idea what's going on with this one? It seems to work fine with my Cloud VPS hosted API: https://ml-article-descriptions.toolforge.org/?lang=fr&title=Pachystomias%20microdon

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Pachystomias microdon", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"error":"AttributeError : 'NoneType' object has no attribute 'shape'"}
real	0m0.728s
user	0m0.030s
sys	0m0.005s

If you need another example, also saw it with an Arabic article (these were all just randomly chosen so it's happening semi-frequently at least outside of English where I haven't observed it yet). Here's a link the expected output.

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "ar", "title": "نفيع (النادرة)", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"error":"AttributeError : 'NoneType' object has no attribute 'shape'"}
real	0m0.369s
user	0m0.023s
sys	0m0.013s

Thoughts

  • We'll want to use three beams (but only using the first two outputs) as we've been doing in the pilot.
  • Latency seems to vary from 10-20 seconds from what I've observed from my examples (but up to 45 seconds when I tried en:Philosophy), which I presume is going to be too slow (@Seddon maybe you have thoughts on what it should be). I generally see latency on the order of 2-3 seconds when I use my Cloud VPS hosted model, which has 8 VCPUs and 16GB RAM (not sure how much that is required but certainly seeing much faster processing of the model). This was fine for our pilot but I know still not ideal because it required some waiting on the user end. What options do we have for speeding up? More CPU/RAM or is there GPU available?
  • We're returning a fairly verbose response right now because it was useful for debugging etc. It shouldn't really affect latency and it'll max out at like 30KB probably but it's more info than is needed by default. @Seddon perhaps you could advise us on what the Android app actually wants and then we only include the rest if a debug parameter is explicitly passed?

As far as latency, our goal in the app for features is 500 milliseconds. I'd say anything over 3 seconds is too slow. When we ran the experiment, the introduction of machine edits increased time spent by 2.4 seconds. In the experiment it took folks 24 seconds to gain context on the feature, choose a machine suggestion, review and publish. 20 seconds just to load is pretty slow.

We're returning a fairly verbose response right now because it was useful for debugging etc. It shouldn't really affect latency and it'll max out at like 30KB probably but it's more info than is needed by default. @Seddon perhaps you could advise us on what the Android app actually wants and then we only include the rest if a debug parameter is explicitly passed?

The Android app only makes use of the prediction and blp fields from the response. I think we should still keep the lang and title fields (in case the API returns an array of such responses in the future), but all the other fields can be considered debugging-only.

Regarding latency: As things are at the moment dropping down to ms latency is something not possible for a model of this size. The base model used is ~5GB. We can revisit and try to hit this kind of latency when we have GPUs in Lift Wing production in some months.
For the time being we can try to reduce serving times trying the following thing or a combination of them:

  • Increase CPU and memory resources for the current deployment
  • Load a quantized version of the model using 8bit integers instead of 16bit floats in order to cut down model size by half.
  • Try to translate the model for optimized inference using a library like ctranslate2 or something similar.
  • Find a smaller model than mbart-large

For options 2 and 4 we need to validate that we don't get a drop in the quality of the output that is not acceptable.

Change 982407 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] article-descriptions: set OMP_NUM_THREADS automatically

https://gerrit.wikimedia.org/r/982407

Regarding latency: As things are at the moment dropping down to ms latency is something not possible for a model of this size. The base model used is ~5GB. We can revisit and try to hit this kind of latency when we have GPUs in Lift Wing production in some months.
For the time being we can try to reduce serving times trying the following thing or a combination of them:

  • Increase CPU and memory resources for the current deployment
  • Load a quantized version of the model using 8bit integers instead of 16bit floats in order to cut down model size by half.
  • Try to translate the model for optimized inference using a library like ctranslate2 or something similar.
  • Find a smaller model than mbart-large

For options 2 and 4 we need to validate that we don't get a drop in the quality of the output that is not acceptable.

Parity is a definitely a minimum goal here for the time being as mentioned by Jazmin above. It would be good to understand what is causing the performance gap here between liftwing and cloudvps.

Change 982407 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: set OMP_NUM_THREADS automatically

https://gerrit.wikimedia.org/r/982407

Change 982803 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-service: deploy new Docker image for article-descriptions

https://gerrit.wikimedia.org/r/982803

Change 982803 merged by Elukey:

[operations/deployment-charts@master] ml-services: deploy new Docker image for article-descriptions

https://gerrit.wikimedia.org/r/982803

@Seddon What is the anticipated load/traffic for this model? It is important for us to know some ballpark numbers in order to craft a deployment strategy. Even some high level statistics would do at this moment.

Thank you all for sharing your feedback. I would like to provide you with an update regarding the progress made in addressing the requests and issues raised by the Research and Android teams during the testing of the article-descriptions inference service. This service is currently hosted on staging in the experimental namespace on LiftWing.

1.Prediction Bug Fixed:

@Isaac reported a prediction bug in T343123#9380779. The ML team investigated the cause of this issue, the Research team provided a fix, and we deployed it in: T352750.

2.API Response Fields Reduced:

@Dbrant specified the fields used by the Android app and said others could be used in a debug mode (T343123#9384441). We added a debug mode to the article-descriptions model-server and when it is activated by setting the debug flag to 1 (i.e {"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}), all 8 API response fields are returned. Otherwise, only 5/8 are returned as detailed in: T352959.

3.Model-Server Response Performance Optimized:

@JTannerWMF specified that latency should be between 500ms-3s in T343123#9381050 and @Seddon wanted to understand the cause of the performance gap between LiftWing and the CloudVPS instance in T343123#9399941.

We worked on optimizing the response performance and were able to improve it for the original request from ~14s to ~4s response time without affecting the prediction quality as shown in: T353127.

In T353127#9416933, @Isaac clarified that despite LiftWing matching the env resources of the CloudVPS instance, the latency is not the same because the CloudVPS API is running an older version of the Descartes library.

4.Load Test Results Shared:

After optimizing the response time for a single request, we ran load tests to measure the number of multiple parallel requests the article-descriptions inference service could handle effectively. With the current setup where the model-server is hosted on 1 pod in the experimental namespace, it can handle a maximum of 20 requests per second as shown in T353952.

5.Next Steps:

We requested the Android team for an estimation of the load expected in T343123#9420718. Once we receive the estimate, we will prepare to allocate the necessary resources to meet the load. In case there are no further issues to address, we will move the model-server to production.

@kevinbazira & @Isaac - We are still outside the latency bounds set by @JTannerWMF and so I'm left wondering if there is anything that can be done to reduce response latency any further?

@Seddon, in T353127 we were able to make significant improvements in response latency. For example, in T353127#9398823, there was a request that initially had a 14s response time. With subsequent optimization efforts, we managed to reduce this to 4s as seen in T353127#9421055. This reduction was achieved by: exceeding the CloudVPS instance CPU and memory resources; and using CPU core pinning. Both of these methods did not affect the prediction quality.

In T353127#9412764, we used a different method called dynamic quantization, which reduced the response time to 2s. However, this method also slightly changed the prediction output. @Isaac's input in T353127#9412894 emphasized the importance of maintaining consistency with the outputs validated during the human evaluation. Given we are striving to stay within the latency bounds set by @JTannerWMF, and dynamic quantization helps us achieve this benchmark, we need to consider the trade-off between latency and prediction matching CloudVPS. Would you like us to proceed with the dynamic quantization method since it is within the latency bounds set by @JTannerWMF?

As part of the Lift Wing expansion we are on path to have GPUs installed in the next months (I'd like to say 2-3 months depending on how procurement goes).
After @Isaac 's suggestion I ran the model on an MI100 AMD GPU we already have on ml-staging and latency drops under 2s (under 1s in many cases) for the requests aforementioned in this task. This is just by utilizing the GPU and without any further inference optimization that can be done which will bring latencies even lower.

I ran a load test for the samples previously gathered by Kevin. Keep in mind that these are not to be compared with previous test results but they are comparable with each other:

TypeName# reqsfailsAvg (ms)Min (ms)Max (ms)Med (ms)req/sfailures/s
POST/v1/models/article-descriptions:predict GPU200(0.00%)2845957498227000.170.00
POST/v1/models/article-descriptions:predict CPU110(0.00%)745349371147472000.100.00
GET/article Cloud VPS (https://ml-article-description-api.wmcloud.org)140(0.00%)54734320763950000.120.00

The following screenshots are taken from the inference services grafana dashboard for the CPU and GPU respectively.

Results with the CPU LiftWing version:

CPU_latencies.png (1×2 px, 298 KB)

Results with the GPU LiftWing version:

GPU_latencies.png (1×2 px, 287 KB)

Some thoughts:

  • 75th and 99th percentiles in the predict phase demonstrates a ~10x speed up with the use of GPU, which can be further improved
  • the main bottleneck is the preprocess step (fetching all the data from mediawiki & wikidata etc.) so we can focus on improving this step.

@Isaac shall I use the test set you previously used to establish a standard procedure for load testing?

This is very useful (and exciting) data -- thank you @isarantopoulos !

shall I use the test set you previously used to establish a standard procedure for load testing?

Looking at Kevin's set, it seems like a reasonable random sample that covers the languages. For the purpose of guiding the conversation, I think that set should be just fine.

the main bottleneck is the preprocess step (fetching all the data from mediawiki & wikidata etc.) so we can focus on improving this step.

I'm curious about this: I never see it go above 0.5 seconds for my sample and Cloud VPS endpoint. Is it much different on LiftWing?

@kevinbazira and @isarantopoulos

The apps team would like to move forward with moving to production with an understanding we would look to leverage GPU time once the cards are available.

I would appreciate hearing more about @Isaac 's query about the origin's of the disparity in the preprocessing step between Cloud VPS and Liftwing and what you think might be achievable there.

We're looking into the increased preprocessing times that we reported to check whether this has to do with specific load requests, a Lift Wing network issue when connecting to the APIs or something else. Then we can continue to push this into production and do some further load testing over there.

After making a change in the service in the way it does preprocessing (by increasing concurrency in preprocessing), we see that performance is now on par with the Cloud VPS instance (if now slightly better) .

TypeName# reqsfailsAvg (ms)Min (ms)Max (ms)Med (ms)req/sfailures/s
POST/v1/models/article-descriptions:predict CPU160(0.00%)46163280705045000.130.00
GET/article Cloud VPS (https://ml-article-description-api.wmcloud.org)140(0.00%)54734320763950000.120.00

Screenshot 2024-02-23 at 10.59.14 PM.png (1×4 px, 400 KB)

@Seddon and @Isaac, the article-descriptions inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:

curl "https://api.wikimedia.org/service/lw/inference/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}'

2.Internal endpoint:

curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org"

3.Documentation:

Please let us know in case there are any edge cases we may have missed. :)

This is really wonderful news! Thanks @kevinbazira for slogging through this with us and @isarantopoulos for your support as well! Those endpoints were working for me too so I'll let Android indicate what the next steps are.