Page MenuHomePhabricator

Investigate increased preprocessing latencies on LW of article-descriptions model
Closed, ResolvedPublic

Description

After the latest load testing on the article-descriptions model server we discovered a discrepancy between in the preprocessing phase when the service is run on Cloud VPS vs when it runs on Lift Wing.

After further investigation done by @kevinbazira we found out that the model server behaves differently when ran from localhost or ml-sandbox. The difference is in the way that we access Rest API. Local runs use RESTBase while Lift Wing is uses REST Gateway.
Taking into consideration that RESTBase is in the process of deprecation, the correct way to access the API is through the Gateway.
In this task we'll investigate the issue to find the root cause of this and answer the following questions:

  • Check whether the issue is caused by misconfiguration on the istio side
  • Check whether the issue is caused by a wrong request (e.g. wrong header etc)
  • The model server makes multiple requests to the rest endpoint. Given that RESTBase has a cache while the Gateway doesn't, is the total increased latency validated because of the cumulative latency of the underlying requests? If so, how could we overcome this?
  • If none of the above yield a solution, investigate what is the deprecation timeline for RESTBase and check if we could use it on the LW service.

Event Timeline

Change 1005720 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] article-descriptions: add more logging for preprocess requests

https://gerrit.wikimedia.org/r/1005720

In yesterday's meeting IIRC @klausman mentioned testing direct API calls to confirm whether the Rest Gateway endpoint used by LiftWing is slower. Here are some API calls we could use to test this:
1.Run within the article-descriptions model server hosted on LiftWing in the experimental namespace.

time curl http://rest-gateway.discovery.wmnet:4111/es.wikipedia.org/v1/page/summary/Madrid

2.Run within the article-descriptions model server hosted outside LiftWing.

time curl https://es.wikipedia.org/api/rest_v1/page/summary/Madrid

@klausman when you get a minute, could you please help run the first API call on LiftWing?

I am going to run the one outside LiftWing.

PS: This API call was chosen for testing because on LW it was part of a preprocess step that had the longest time in P57453#232415

This is run from within the container article-descriptions-predictor-default-00025-deployment-5czmjql currently running on staging:

>>> import requests
>>> import time
>>> for x in range(10):
...  start=time.time()
...  requests.get('http://rest-gateway.discovery.wmnet:4111/es.wikipedia.org/v1/page/summary/Madrid')
...  print(time.time()-start)
... 
<Response [200]>
0.4481840133666992
<Response [200]>
0.5158281326293945
<Response [200]>
0.4289090633392334
<Response [200]>
0.4010586738586426
<Response [200]>
0.30234456062316895
<Response [200]>
0.4509139060974121
<Response [200]>
0.3973238468170166
<Response [200]>
0.4100210666656494
<Response [200]>
0.4207029342651367
<Response [200]>
0.4079725742340088
>>>

Most requests are <500ms, none are <300ms by more than a single digit amount of milliseconds.

This comment was removed by klausman.

And with a variety of pages requested:

>>> for p in ["Madrid", "Spaghetti", "Coal", "Light", "Poetry"]:
...  tt=0
...  for x in range(10):
...   start=time.time()
...   _=requests.get(f'http://rest-gateway.discovery.wmnet:4111/es.wikipedia.org/v1/page/summary/{p}')
...   tt += time.time()-start
...  print(p, tt/10)
... 
Madrid 0.3972129821777344
Spaghetti 0.18262295722961425
Coal 0.11828050613403321
Light 0.19698092937469483
Poetry 1.2573089838027953

Note the wide variety of latencies, spanning fromn 118ms for "Coal", to more than 10x that for "Poetry". This indicates to me that any rigorous latency testing has to use a wide dataset of pages that the summaries are requested for.

I have run the second API call using the same LocalServer on the ml-sanbox (as used in P57453#232415). Below are the results:

$ time curl https://es.wikipedia.org/api/rest_v1/page/summary/Madrid
{"type":"standard","title":"Madrid","displaytitle":"<span class=\"mw-page-title-main\">Madrid</span>","namespace":{"id":0,"text":""},"wikibase_item":"Q2807","titles":{"canonical":"Madrid","normalized":"Madrid","display":"<span class=\"mw-page-title-main\">Madrid</span>"},"pageid":1791,"thumbnail":{"source":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Bandera_de_la_ciudad_de_Madrid.svg/langes-320px-Bandera_de_la_ciudad_de_Madrid.svg.png","width":320,"height":213},"originalimage":{"source":"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Bandera_de_la_ciudad_de_Madrid.svg/langes-1500px-Bandera_de_la_ciudad_de_Madrid.svg.png","width":1500,"height":1000},"lang":"es","dir":"ltr","revision":"158343028","tid":"8ade1a10-d0ba-11ee-b43f-d31ec09a3bed","timestamp":"2024-02-21T13:09:02Z","description":"capital y municipio más poblado de España","description_source":"central","coordinates":{"lat":40.41694444,"lon":-3.70333333},"content_urls":{"desktop":{"page":"https://es.wikipedia.org/wiki/Madrid","revisions":"https://es.wikipedia.org/wiki/Madrid?action=history","edit":"https://es.wikipedia.org/wiki/Madrid?action=edit","talk":"https://es.wikipedia.org/wiki/Discusi%C3%B3n:Madrid"},"mobile":{"page":"https://es.m.wikipedia.org/wiki/Madrid","revisions":"https://es.m.wikipedia.org/wiki/Special:History/Madrid","edit":"https://es.m.wikipedia.org/wiki/Madrid?action=edit","talk":"https://es.m.wikipedia.org/wiki/Discusi%C3%B3n:Madrid"}},"extract":"Madrid es un municipio y una ciudad de España. La localidad, con categoría histórica de villa, es la capital del Estado y de la Comunidad de Madrid. En su término municipal, el más poblado de España, están empadronadas 3 280 782 personas, constituyéndose como la segunda ciudad más poblada de la Unión Europea, así como su área metropolitana, con 6 779 888 habitantes empadronados.","extract_html":"<p><b>Madrid</b> es un municipio y una ciudad de España. La localidad, con categoría histórica de villa, es la capital del Estado y de la Comunidad de Madrid. En su término municipal, el más poblado de España, están empadronadas <span>3 280 782 personas</span>, constituyéndose como la segunda ciudad más poblada de la Unión Europea, así como su área metropolitana, con <span>6 779 888 habitantes</span> empadronados.</p>"}
real	0m0.047s
user	0m0.025s
sys	0m0.017s

Another option is using something like https://mobileapps.discovery.wmnet:4102/es.wikipedia.org/v1/page/summary/Madrid, so there would be neither RESTbase nor the REST API in the path, but I am seeing similar latencies there.

Change 1005720 merged by Ilias Sarantopoulos:

[machinelearning/liftwing/inference-services@main] article-descriptions: add more logging for preprocess requests

https://gerrit.wikimedia.org/r/1005720

I added some more logging so that we can get a better break down of where the time is being spent and it turns out that it is the MWAPI that is taking too long

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Tour_Eiffel", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"fr","title":"Tour_Eiffel","blp":false,"num_beams":2,"groundtruth":"monument de Paris, France","latency":{"wikidata-info (s)":0.0849905014038086,"mwapi - first paragraphs (s)":6.222133636474609,"total network (s)":6.26397967338562,"model (s)":4.969624042510986,"total (s)":11.23367714881897},"features"

In the specific example 16 requests are made and all 16 extracts are returned to be processed by the model. Although the requests seem to be made asynchronously they take approx 6 seconds to complete.
Regarding these requests, I think we should also be spcifying the Host header when making requests to Rest Gateway here

I added some more logging so that we can get a better break down of where the time is being spent and it turns out that it is the MWAPI that is taking too long

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Tour_Eiffel", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"fr","title":"Tour_Eiffel","blp":false,"num_beams":2,"groundtruth":"monument de Paris, France","latency":{"wikidata-info (s)":0.0849905014038086,"mwapi - first paragraphs (s)":6.222133636474609,"total network (s)":6.26397967338562,"model (s)":4.969624042510986,"total (s)":11.23367714881897},"features"

In the specific example 16 requests are made and all 16 extracts are returned to be processed by the model. Although the requests seem to be made asynchronously they take approx 6 seconds to complete.
Regarding these requests, I think we should also be spcifying the Host header when making requests to Rest Gateway here

The host header isn't mandatory for these requests, it will always be taken from the path after the /v1/ for all current services.

Just to clarify, external requests against the APIs will benefit from multiple caches so they will almost always be faster for material that has been previously accessed. Direct requests to restbase don't benefit from the edge caches but they will still be faster because of restbase's own storage. Mobileapps itself will be moving the storage behaviour from restbase to itself in anticipation of deprecation sometime soon (T348995) which will give some speedups when accessed directly

The following doesn't have to do with the Rest Gateway from Lift Wing vs RESTBase from localhost discrepancies but is a supplementary thing that will speed up requests.
After digging a bit in how the model works as well as the code I found the following:

  • the model takes into account articles (the first paragraphs in our case) and short descriptions in all languages where the article is available.

This means that in articles that exist in multiple wikis a bunch of requests are made. An example request is the following {"lang": "fr", "title": "Tour_Eiffel", "num_beams": 3} (referred also in previous messages) which exists in 16 languages. This means that 16 requests are made towards the REST Gateway.
Now the issue is that these requests are made sequentially. We should switch to concurrent execution to benefit from it. @Isaac had done so in his implementation using threads.
I'm making a suggestion to do it with asyncio gather. Some local tests I ran cut down preprocessing latency to half for this request.

Change 1006008 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] article-descriptions: make concurrent requests to REST gateway

https://gerrit.wikimedia.org/r/1006008

  • the model takes into account articles (the first paragraphs in our case) and short descriptions in all languages where the article is available.

This puzzles me. Is that really necessary for the model to work?

At any rate, it probably explains the wide variance of latencies between different articles.

This puzzles me. Is that really necessary for the model to work?

Yes and no -- the model is multilingual so you can think of it doing a mixture of finding the right phrase from the first paragraph of the article within the target language with translating over descriptions from other languages and translating+extracting phrases from articles in other language editions too. In reality, I suspect we could come up with a simple but smart way to cap how many languages it queries without actually reducing the output quality (because I'd guess that the model has everything it really needs after probably at most 5 or so languages). But also in reality most article descriptions that are missing will be for languages in which there are at most only a few languages so this sort of optimization hasn't been tested because I think it'd be triggered pretty rarely. If we feel this is important, I can do some tests to see how much this changes things output-wise. Very simple code-wise at least, you'd still gather all the possible sitelinks but then cap them with something like (here we take five largest language editions):

descriptions, sitelinks, blp = await self.get_wikidata_info(lang, title)
# new code - excuse its hackiness
lang_by_size = {l:i for i,l in enumerate(['en', 'de', 'nl', 'es', 'it', 'ru', 'fr', 'zh', 'ar', 'vi', 'ja', 'fi', 'ko', 'tr', 'ro', 'cs', 'et', 'lt', 'kk', 'lv', 'hi', 'ne', 'my', 'si', 'gu'])}
sitelinks = {l:sitelinks[l] for i,l in enumerate(sorted(sitelinks, key=lang_by_size.get)) if i <= 5}

This puzzles me. Is that really necessary for the model to work?

If the way the model does inference is tweaked we could get past that, but from the paper all languages are used. This is mentioned as Article fusion on Figure 2, Page 3. (screenshot below)

Screenshot 2024-02-23 at 6.10.29 PM.png (1×3 px, 1 MB)

Change 1006008 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-descriptions: make concurrent requests to REST gateway

https://gerrit.wikimedia.org/r/1006008

After testing the above change in ml-staging, preprocessing latency for the above request ({"lang": "fr", "title": "Tour_Eiffel", "num_beams": 3}) drops to 0.789 s from 6.26s (8x faster!), with all the improvement coming from the asynchronous requests to get the first paragraphs.

time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Tour_Eiffel", "num_beams": 2, "debug": 1}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"lang":"fr","title":"Tour_Eiffel","blp":false,"num_beams":2,"groundtruth":"monument de Paris, France","latency":{"wikidata-info (s)":0.07444643974304199,"mwapi - first paragraphs (s)":0.7440714836120605,"total network (s)":0.7892155647277832,"model (s)":5.636499643325806,"total (s)":6.425840139389038},

Although this is a great improvement this still doesn't explain the differences demonstrated in the request time between Localrun and LW.

A new load test will tell us more though.

Change 1006074 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update article descriptions image

https://gerrit.wikimedia.org/r/1006074

Change 1006074 merged by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: update article descriptions image and add GPU

https://gerrit.wikimedia.org/r/1006074

Can we investigation reducing the computational need to just the language requested?

calbon updated Other Assignee, added: kevinbazira.
calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.
calbon moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.

Can we investigation reducing the computational need to just the language requested?

The model definitely benefits from some translations so "just the language requested" I would say is not the right approach. Are you suggesting capping it at 5 for example? It no longer seems to be an issue from the pre-processing perspective with Ilias' fix at least so hopefully not a blocker at this point and just a bonus for capping model latency. That said, if there's a desire to further constrain, I can look into it in the next few weeks but please don't let it be a blocker given that Android has said they're comfortable moving forward per T343123#9558740.

Can we investigation reducing the computational need to just the language requested?

The model needs the input from other languages (this is how it is trained) so any additional limitations will likely hurt its quality.
Since we don't see a performance penalty at the moment we can proceed as is and revisit capping it to a specific number if we have issues in the future.