Migrate Machine-generated Article Descriptions from toolforge to liftwing.
Open, LowPublic
Actions

Assigned To

Authored By

	Seddon
	Jul 31 2023, 12:54 PM

Description

The apps teams would like to migrate https://ml-article-descriptions.toolforge.org/ off of toolforge into a more production setting and its been suggested that Liftwing should be the destination.

What use case is the model going to support/resolve?
Android app users currently have an entry point for adding Wikidata descriptions to Wikipedia articles that are missing them. This model provides recommended descriptions to add to simplify the process and help codify norms around what these descriptions should look like. More background: https://www.mediawiki.org/wiki/Wikimedia_Apps/Team/Android/Machine_Assisted_Article_Descriptions

Do you have a model card? If you don't know what it is, please check https://meta.wikimedia.org/wiki/Machine_learning_models.
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_descriptions

What team created/trained/etc.. the model? What tools and frameworks have you used?
A group of external researchers and often-collaborators from EPFL trained the model. More details in the paper (arxiv). Essentially they're using a modified transformers library to merge an mBART model that takes article paragraphs as input and an mBERT model that takes existing article descriptions from other languages as input (paper overview). NOTE: there are some details in the paper such as the Wikidata knowledge graph embeddings that were not used in the deployed model.

Here's their raw code: https://github.com/epfl-dlab/transformers-modified (and a more general repo that also includes code for training)
And the API code that uses the model and has a copy of the modified transformers library: https://github.com/geohci/transformers-modified/blob/main/artdescapi/wsgi_template.py
The raw PyTorch model binary and supporting config etc. can be found here: https://drive.google.com/file/d/1bhn5O2WW6uXo4UvKDFoHqQnc0ozCCXmi/view?usp=sharing

What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?

Input: Wikipedia article
Features: first paragraph of that article (if it exists -- not strictly necessary), first paragraph of the article in any of the 24 other languages supported by the model, existing article descriptions in any of the other 24 languages.
Output: k suggested article descriptions where k is an adjustable parameter but we've found that generating the top 3 and recommending the top 2 from that seems to best balance utility, quality, and diversity.

If you have a minimal codebase that you used to run the first tests with the model, could you please share it?

Working API (with model binary files): https://github.com/geohci/transformers-modified/tree/main

State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).
Joseph Seddon

What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
Generally looking at 3-4 seconds depending on the input. You can test it out with various inputs at this UI for the Cloud VPS endpoint and see some stats on latency from this notebook. It's highly dependent on how much content exists -- e.g., articles with many language editions + existing article descriptions are a good bit slower (upwards of 10+ seconds) than articles that only exist in a single language (1-2 seconds). There is some allowance for latency of a second or two in the UI but obviously any speed-ups would be nice and there might be ways to downsample the languages considered for highly multilingual topics to reduce some of the worse-case latencies. A GPU would obviously speed it up but it's possible that something like CTranslate2 could be applied or even just that LiftWing has better hardware for this use-case than Cloud Services and it won't be an issue. These more multilingual articles often already have article descriptions too so are less likely to need the tool.

Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
At this point, there hasn't been discussion around re-training. Fine-tuning is certainly possible but I think not too urgent given that there's a pretty large dataset of existing Wikidata descriptions that the model was trained on and I don't foresee a massive amount of data drift. This repo has some details on training.

Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
A few analyses:

Initial potential harm exploration that led us to institute guardrails around which editors will have access to recommendations for biographies of living people.
The tool was piloted for several weeks in early 2023 and the edits made with the tool were evaluated by a number of volunteers across different languages (slide deck).

Anything thing else that is relevant in your opinion :)

If I foresee a potential engineering challenge, it's that the model inference code currently depends on a modified, static snapshot of the transformers repo. Long-term, that might not be desirable as it would be hard to keep it up-to-date with improvements made to the broader transformers library. I'm not sure how feasible it is, but it might be worth considering whether it's possible to convert it from a modified snapshot of the transformers library to something more like a wrapper around it.
There is an additional issue that was discovered during this testing which is that occasionally the model will "hallucinate" dates of births for people. This stems from a difficulty with these language models in handling numbers (they tend to tokenize them as single digits and so have trouble reasoning across things like dates) as well as the source of training data (TextExtracts), which removes dates from the article lead and thus likely make it more difficult for the model to learn how to handle them. I'm open to a discussion on how to handle this but after chatting with the EPFL researchers about it, I personally think it'd be easy and reasonable to generate a simple filter that removes any recommendations that contain a date that is not seen in the input paragraph / description data.

Details

Subject	Repo	Branch	Lines +/-
ml-services: deploy new Docker image for article-descriptions	operations/deployment-charts	master	+1 -3
article-descriptions: set OMP_NUM_THREADS automatically	machinelearning/liftwing/inference-services	main	+6 -0
ml-services: article-description set OMP_NUM_THREADS=1	operations/deployment-charts	master	+2 -0
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+1 -1
article-descriptions: use a dedicated aiohttp session for rest-gateway	machinelearning/liftwing/inference-services	main	+1 -1
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+1 -1
article-descriptions: remove host header from rest-gateway endpoint	machinelearning/liftwing/inference-services	main	+0 -1
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+3 -1
article-descriptions: fix wikipedia api summary endpoint	machinelearning/liftwing/inference-services	main	+2 -1
ml-serve/istio: fix wrong port in destination rule for restgw	operations/deployment-charts	master	+9 -1
ml-serve/istio: Add Restbase as a handled destination for requests from LW	operations/deployment-charts	master	+65 -0
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+1 -1
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+1 -1
article-descriptions: update wiki host headers in model-server	machinelearning/liftwing/inference-services	main	+11 -6
ml-services: update article-descriptions isvc image in the experimental namespace	operations/deployment-charts	master	+19 -0
article-descriptions: update model-server to use local files only	machinelearning/liftwing/inference-services	main	+35 -32
ml-services: add article-descriptions isvc to experimental namespace	operations/deployment-charts	master	+19 -0
article-descriptions: add article-descriptions model server	machinelearning/liftwing/inference-services	main	+530 -0
inference-services: add CI pipeline jobs for article-descriptions model-server	integration/config	master	+15 -0

Related Objects
Search...

Status	Assigned	Task
Open	kevinbazira	T343123 Migrate Machine-generated Article Descriptions from toolforge to liftwing.
Resolved	calbon	T344010 Discuss potential migration from toolforge to liftwing
Resolved	kevinbazira	T352750 Investigate prediction bug in article-descriptions model-server
Resolved	kevinbazira	T352959 Reduce default API response fields for article-descriptions model-server
Resolved	kevinbazira	T353127 Optimize response performance for the article-descriptions model-server
Open	kevinbazira	T353952 Run load tests for the article-descriptions isvc
Resolved	isarantopoulos	T358195 Investigate increased preprocessing latencies on LW of article-descriptions model
In Progress	kevinbazira	T358467 Move the article-descriptions model server from staging to production
Resolved	klausman	T358654 Create external endpoint for article-descriptions isvc hosted on LiftWing
Open	klausman	T358655 Set SLO for the article-descriptions isvc hosted on LiftWing

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 970831 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: add article-descriptions model server

https://gerrit.wikimedia.org/r/970831

kevinbazira mentioned this in rMLIS0ae08690aa2b: article-descriptions: add article-descriptions model server.Nov 20 2023, 3:38 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 20 2023, 4:11 PM

Change 975929 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add article-descriptions isvc to experimental namespace

https://gerrit.wikimedia.org/r/975929

gerritbot added a project: Patch-For-Review.Nov 21 2023, 12:26 PM

Change 975929 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add article-descriptions isvc to experimental namespace

https://gerrit.wikimedia.org/r/975929

Maintenance_bot removed a project: Patch-For-Review.Nov 21 2023, 1:30 PM

Change 975936 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update model-server to use local files only

https://gerrit.wikimedia.org/r/975936

gerritbot added a project: Patch-For-Review.Nov 22 2023, 8:11 AM

Change 975936 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: update model-server to use local files only

https://gerrit.wikimedia.org/r/975936

kevinbazira mentioned this in rMLIS18327eecd382: article-descriptions: update model-server to use local files only.Nov 23 2023, 2:53 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 23 2023, 3:11 PM

Change 976960 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/976960

gerritbot added a project: Patch-For-Review.Nov 24 2023, 6:52 AM

Change 976960 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/976960

Maintenance_bot removed a project: Patch-For-Review.Nov 24 2023, 10:10 AM

Change 976965 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update wikidata host header in model-server

https://gerrit.wikimedia.org/r/976965

gerritbot added a project: Patch-For-Review.Nov 24 2023, 1:18 PM

Change 977226 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: update wiki host headers in model-server

https://gerrit.wikimedia.org/r/977226

Change 977226 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: update wiki host headers in model-server

https://gerrit.wikimedia.org/r/977226

kevinbazira mentioned this in rMLIS3a31216609d0: article-descriptions: update wiki host headers in model-server.Nov 24 2023, 3:15 PM

Change 977234 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977234

Change 977234 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977234

Change 977714 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977714

Change 977714 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/977714

Change 978059 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve/istio: Add Restbase as a handled destination for requests from LW

https://gerrit.wikimedia.org/r/978059

Change 978059 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve/istio: Add Restbase as a handled destination for requests from LW

https://gerrit.wikimedia.org/r/978059

Change 978098 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve/istio: fix wrong port in destination rule for restgw

https://gerrit.wikimedia.org/r/978098

Change 978098 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve/istio: fix wrong port in destination rule for restgw

https://gerrit.wikimedia.org/r/978098

Change 976965 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: fix wikipedia api summary endpoint

https://gerrit.wikimedia.org/r/976965

kevinbazira mentioned this in rMLISbbe397fb4d34: article-descriptions: fix wikipedia api summary endpoint.Nov 29 2023, 10:07 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 29 2023, 10:10 AM

Change 978168 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978168

gerritbot added a project: Patch-For-Review.Nov 29 2023, 10:21 AM

Change 978168 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978168

Maintenance_bot removed a project: Patch-For-Review.Nov 29 2023, 11:11 AM

Change 978170 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-descriptions: remove host header from rest-gateway endpoint

https://gerrit.wikimedia.org/r/978170

gerritbot added a project: Patch-For-Review.Nov 29 2023, 11:30 AM

Change 978170 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: remove host header from rest-gateway endpoint

https://gerrit.wikimedia.org/r/978170

kevinbazira mentioned this in rMLISa8006485f3c6: article-descriptions: remove host header from rest-gateway endpoint.Nov 29 2023, 11:40 AM

Change 978171 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978171

Change 978171 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978171

Maintenance_bot removed a project: Patch-For-Review.Nov 29 2023, 12:10 PM

Change 978542 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] article-descriptions: use a dedicated aiohttp session for rest-gateway

https://gerrit.wikimedia.org/r/978542

gerritbot added a project: Patch-For-Review.Nov 29 2023, 1:44 PM

Change 978542 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: use a dedicated aiohttp session for rest-gateway

https://gerrit.wikimedia.org/r/978542

elukey mentioned this in rMLIS602ac550c013: article-descriptions: use a dedicated aiohttp session for rest-gateway.Nov 29 2023, 1:57 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 29 2023, 2:10 PM

Change 978651 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978651

gerritbot added a project: Patch-For-Review.Nov 30 2023, 6:57 AM

Change 978651 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-descriptions isvc image in the experimental namespace

https://gerrit.wikimedia.org/r/978651

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2023, 9:11 AM

Change 979042 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-services/artfcle-description: set OMP_NUM_THREADS=1

https://gerrit.wikimedia.org/r/979042

gerritbot added a project: Patch-For-Review.Nov 30 2023, 10:47 AM

Change 979042 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: article-description set OMP_NUM_THREADS=1

https://gerrit.wikimedia.org/r/979042

Maintenance_bot removed a project: Patch-For-Review.Nov 30 2023, 11:10 AM

The article-descriptions model-server has been deployed in the LiftWing experimental namespace. It is currently available through an internal endpoint that can only be accessed by tools that run within the WMF infrastructure:

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.038863420486450195,"total network (s)":0.367002010345459,"model (s)":14.40719723701477,"total (s)":14.774216890335083},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
real	0m14.815s
user	0m0.013s
sys	0m0.000s

@Isaac and @Seddon please test it and let us know of any edge cases you may come across. Once you have confirmed that there are none, we shall prepare to move it to production and provide an external endpoint.

Thanks @kevinbazira ! Awesome to see this working! A bug I uncovered below and then a few thoughts:

Bug

Any idea what's going on with this one? It seems to work fine with my Cloud VPS hosted API: https://ml-article-descriptions.toolforge.org/?lang=fr&title=Pachystomias%20microdon

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Pachystomias microdon", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"error":"AttributeError : 'NoneType' object has no attribute 'shape'"}
real	0m0.728s
user	0m0.030s
sys	0m0.005s

If you need another example, also saw it with an Arabic article (these were all just randomly chosen so it's happening semi-frequently at least outside of English where I haven't observed it yet). Here's a link the expected output.

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "ar", "title": "نفيع (النادرة)", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"error":"AttributeError : 'NoneType' object has no attribute 'shape'"}
real	0m0.369s
user	0m0.023s
sys	0m0.013s

Thoughts

We'll want to use three beams (but only using the first two outputs) as we've been doing in the pilot.
Latency seems to vary from 10-20 seconds from what I've observed from my examples (but up to 45 seconds when I tried en:Philosophy), which I presume is going to be too slow (@Seddon maybe you have thoughts on what it should be). I generally see latency on the order of 2-3 seconds when I use my Cloud VPS hosted model, which has 8 VCPUs and 16GB RAM (not sure how much that is required but certainly seeing much faster processing of the model). This was fine for our pilot but I know still not ideal because it required some waiting on the user end. What options do we have for speeding up? More CPU/RAM or is there GPU available?
We're returning a fairly verbose response right now because it was useful for debugging etc. It shouldn't really affect latency and it'll max out at like 30KB probably but it's more info than is needed by default. @Seddon perhaps you could advise us on what the Android app actually wants and then we only include the rest if a debug parameter is explicitly passed?

As far as latency, our goal in the app for features is 500 milliseconds. I'd say anything over 3 seconds is too slow. When we ran the experiment, the introduction of machine edits increased time spent by 2.4 seconds. In the experiment it took folks 24 seconds to gain context on the feature, choose a machine suggestion, review and publish. 20 seconds just to load is pretty slow.

calbon closed subtask T344010: Discuss potential migration from toolforge to liftwing as Resolved.Dec 5 2023, 3:45 PM

We're returning a fairly verbose response right now because it was useful for debugging etc. It shouldn't really affect latency and it'll max out at like 30KB probably but it's more info than is needed by default. @Seddon perhaps you could advise us on what the Android app actually wants and then we only include the rest if a debug parameter is explicitly passed?

The Android app only makes use of the prediction and blp fields from the response. I think we should still keep the lang and title fields (in case the API returns an array of such responses in the future), but all the other fields can be considered debugging-only.

Regarding latency: As things are at the moment dropping down to ms latency is something not possible for a model of this size. The base model used is ~5GB. We can revisit and try to hit this kind of latency when we have GPUs in Lift Wing production in some months.
For the time being we can try to reduce serving times trying the following thing or a combination of them:

Increase CPU and memory resources for the current deployment
Load a quantized version of the model using 8bit integers instead of 16bit floats in order to cut down model size by half.
Try to translate the model for optimized inference using a library like ctranslate2 or something similar.
Find a smaller model than mbart-large

For options 2 and 4 we need to validate that we don't get a drop in the quality of the output that is not acceptable.

Change 982407 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] article-descriptions: set OMP_NUM_THREADS automatically

https://gerrit.wikimedia.org/r/982407

gerritbot added a project: Patch-For-Review.Dec 12 2023, 2:23 PM

In T343123#9395763, @isarantopoulos wrote:

Regarding latency: As things are at the moment dropping down to ms latency is something not possible for a model of this size. The base model used is ~5GB. We can revisit and try to hit this kind of latency when we have GPUs in Lift Wing production in some months.
For the time being we can try to reduce serving times trying the following thing or a combination of them:

Increase CPU and memory resources for the current deployment

Load a quantized version of the model using 8bit integers instead of 16bit floats in order to cut down model size by half.

Try to translate the model for optimized inference using a library like ctranslate2 or something similar.

Find a smaller model than mbart-large

For options 2 and 4 we need to validate that we don't get a drop in the quality of the output that is not acceptable.

Parity is a definitely a minimum goal here for the time being as mentioned by Jazmin above. It would be good to understand what is causing the performance gap here between liftwing and cloudvps.

isarantopoulos closed subtask T352959: Reduce default API response fields for article-descriptions model-server as Resolved.Dec 12 2023, 3:30 PM

Change 982407 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: set OMP_NUM_THREADS automatically

https://gerrit.wikimedia.org/r/982407

elukey mentioned this in rMLISc4a608bc58f4: article-descriptions: set OMP_NUM_THREADS automatically.Dec 13 2023, 11:51 AM

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2023, 12:10 PM

Change 982803 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-service: deploy new Docker image for article-descriptions

https://gerrit.wikimedia.org/r/982803

gerritbot added a project: Patch-For-Review.Dec 13 2023, 2:22 PM

Change 982803 merged by Elukey:

[operations/deployment-charts@master] ml-services: deploy new Docker image for article-descriptions

https://gerrit.wikimedia.org/r/982803

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2023, 2:30 PM

kevinbazira closed subtask T352750: Investigate prediction bug in article-descriptions model-server as Resolved.Dec 20 2023, 3:28 PM

@Seddon What is the anticipated load/traffic for this model? It is important for us to know some ballpark numbers in order to craft a deployment strategy. Even some high level statistics would do at this moment.

kevinbazira mentioned this in T353952: Run load tests for the article-descriptions isvc.Dec 22 2023, 2:49 PM

kevinbazira mentioned this in rMLISf14d8faaa4b9: test: add load test script and input for article-descriptions.Dec 22 2023, 5:24 PM

Seddon mentioned this in T354341: MGAD metrics request - edit frequency.Jan 4 2024, 12:21 PM

Thank you all for sharing your feedback. I would like to provide you with an update regarding the progress made in addressing the requests and issues raised by the Research and Android teams during the testing of the article-descriptions inference service. This service is currently hosted on staging in the experimental namespace on LiftWing.

1.Prediction Bug Fixed:

@Isaac reported a prediction bug in T343123#9380779. The ML team investigated the cause of this issue, the Research team provided a fix, and we deployed it in: T352750.

2.API Response Fields Reduced:

@Dbrant specified the fields used by the Android app and said others could be used in a debug mode (T343123#9384441). We added a debug mode to the article-descriptions model-server and when it is activated by setting the debug flag to 1 (i.e {"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}), all 8 API response fields are returned. Otherwise, only 5/8 are returned as detailed in: T352959.

3.Model-Server Response Performance Optimized:

@JTannerWMF specified that latency should be between 500ms-3s in T343123#9381050 and @Seddon wanted to understand the cause of the performance gap between LiftWing and the CloudVPS instance in T343123#9399941.

We worked on optimizing the response performance and were able to improve it for the original request from ~14s to ~4s response time without affecting the prediction quality as shown in: T353127.

In T353127#9416933, @Isaac clarified that despite LiftWing matching the env resources of the CloudVPS instance, the latency is not the same because the CloudVPS API is running an older version of the Descartes library.

4.Load Test Results Shared:

After optimizing the response time for a single request, we ran load tests to measure the number of multiple parallel requests the article-descriptions inference service could handle effectively. With the current setup where the model-server is hosted on 1 pod in the experimental namespace, it can handle a maximum of 20 requests per second as shown in T353952.

5.Next Steps:

We requested the Android team for an estimation of the load expected in T343123#9420718. Once we receive the estimate, we will prepare to allocate the necessary resources to meet the load. In case there are no further issues to address, we will move the model-server to production.

@kevinbazira & @Isaac - We are still outside the latency bounds set by @JTannerWMF and so I'm left wondering if there is anything that can be done to reduce response latency any further?

@Seddon, in T353127 we were able to make significant improvements in response latency. For example, in T353127#9398823, there was a request that initially had a 14s response time. With subsequent optimization efforts, we managed to reduce this to 4s as seen in T353127#9421055. This reduction was achieved by: exceeding the CloudVPS instance CPU and memory resources; and using CPU core pinning. Both of these methods did not affect the prediction quality.

In T353127#9412764, we used a different method called dynamic quantization, which reduced the response time to 2s. However, this method also slightly changed the prediction output. @Isaac's input in T353127#9412894 emphasized the importance of maintaining consistency with the outputs validated during the human evaluation. Given we are striving to stay within the latency bounds set by @JTannerWMF, and dynamic quantization helps us achieve this benchmark, we need to consider the trade-off between latency and prediction matching CloudVPS. Would you like us to proceed with the dynamic quantization method since it is within the latency bounds set by @JTannerWMF?

As part of the Lift Wing expansion we are on path to have GPUs installed in the next months (I'd like to say 2-3 months depending on how procurement goes).
After @Isaac 's suggestion I ran the model on an MI100 AMD GPU we already have on ml-staging and latency drops under 2s (under 1s in many cases) for the requests aforementioned in this task. This is just by utilizing the GPU and without any further inference optimization that can be done which will bring latencies even lower.

I ran a load test for the samples previously gathered by Kevin. Keep in mind that these are not to be compared with previous test results but they are comparable with each other:

Type	Name	# reqs	fails	Avg (ms)	Min (ms)	Max (ms)	Med (ms)	req/s	failures/s
POST	/v1/models/article-descriptions:predict GPU	20	0(0.00%)	2845	957	4982	2700	0.17	0.00
POST	/v1/models/article-descriptions:predict CPU	11	0(0.00%)	7453	4937	11474	7200	0.10	0.00
GET	/article Cloud VPS (https://ml-article-description-api.wmcloud.org)	14	0(0.00%)	5473	4320	7639	5000	0.12	0.00

The following screenshots are taken from the inference services grafana dashboard for the CPU and GPU respectively.

Results with the CPU LiftWing version:

Results with the GPU LiftWing version:

Some thoughts:

75th and 99th percentiles in the predict phase demonstrates a ~10x speed up with the use of GPU, which can be further improved
the main bottleneck is the preprocess step (fetching all the data from mediawiki & wikidata etc.) so we can focus on improving this step.

@Isaac shall I use the test set you previously used to establish a standard procedure for load testing?

This is very useful (and exciting) data -- thank you @isarantopoulos !

shall I use the test set you previously used to establish a standard procedure for load testing?

Looking at Kevin's set, it seems like a reasonable random sample that covers the languages. For the purpose of guiding the conversation, I think that set should be just fine.

the main bottleneck is the preprocess step (fetching all the data from mediawiki & wikidata etc.) so we can focus on improving this step.

I'm curious about this: I never see it go above 0.5 seconds for my sample and Cloud VPS endpoint. Is it much different on LiftWing?

@kevinbazira and @isarantopoulos

The apps team would like to move forward with moving to production with an understanding we would look to leverage GPU time once the cards are available.

I would appreciate hearing more about @Isaac 's query about the origin's of the disparity in the preprocessing step between Cloud VPS and Liftwing and what you think might be achievable there.

kevinbazira mentioned this in P57453 article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing.Feb 21 2024, 8:23 AM

We're looking into the increased preprocessing times that we reported to check whether this has to do with specific load requests, a Lift Wing network issue when connecting to the APIs or something else. Then we can continue to push this into production and do some further load testing over there.

After making a change in the service in the way it does preprocessing (by increasing concurrency in preprocessing), we see that performance is now on par with the Cloud VPS instance (if now slightly better) .

Type	Name	# reqs	fails	Avg (ms)	Min (ms)	Max (ms)	Med (ms)	req/s	failures/s
POST	/v1/models/article-descriptions:predict CPU	16	0(0.00%)	4616	3280	7050	4500	0.13	0.00
GET	/article Cloud VPS (https://ml-article-description-api.wmcloud.org)	14	0(0.00%)	5473	4320	7639	5000	0.12	0.00

Screenshot 2024-02-23 at 10.59.14 PM.png (1×4 px, 400 KB)

kevinbazira mentioned this in T358467: Move the article-descriptions model server from staging to production.Feb 26 2024, 8:07 AM

kevinbazira changed the status of subtask T358467: Move the article-descriptions model server from staging to production from Open to In Progress.

kevinbazira mentioned this in T358655: Set SLO for the article-descriptions isvc hosted on LiftWing.Feb 28 2024, 11:33 AM

Isaac mentioned this in T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.Feb 28 2024, 7:44 PM

@Seddon and @Isaac, the article-descriptions inference service is now live in LiftWing production. It can be accessed through:
1.External endpoint:

curl "https://api.wikimedia.org/service/lw/inference/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}'

2.Internal endpoint:

curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.article-descriptions.wikimedia.org"

3.Documentation:

Please let us know in case there are any edge cases we may have missed. :)

This is really wonderful news! Thanks @kevinbazira for slogging through this with us and @isarantopoulos for your support as well! Those endpoints were working for me too so I'll let Android indicate what the next steps are.

HNordeenWMF mentioned this in T360581: [SPIKE] Test MGAD Model on LiftWing.Mar 20 2024, 10:49 PM

isarantopoulos closed subtask T358195: Investigate increased preprocessing latencies on LW of article-descriptions model as Resolved.Tue, Apr 2, 6:49 AM

kevinbazira closed subtask T353127: Optimize response performance for the article-descriptions model-server as Resolved.Tue, Apr 9, 3:01 PM

	F42070469: Screenshot 2024-02-23 at 10.59.14 PM.png
	Feb 23 2024, 9:06 PM

	F41794922: GPU_latencies.png
	Feb 7 2024, 9:53 AM

	F41794920: CPU_latencies.png
	Feb 7 2024, 9:53 AM

Migrate Machine-generated Article Descriptions from toolforge to liftwing.Open, LowPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Bug

Thoughts

Migrate Machine-generated Article Descriptions from toolforge to liftwing.
Open, LowPublic
Actions

Related Objects
Search...