article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing
ActivePublic
Actions

Authored by kevinbazira on Feb 21 2024, 8:15 AM.

Tags

Referenced Files

	F42022960: LiftWing article-descriptions api-ro.discovery.wmnet latency (Screenshot from 2024-02-21 17-25-36).png
	Feb 21 2024, 3:18 PM

	F42022948: LiftWing article-descriptions rest-gateway.discovery.wmnet latency (Screenshot from 2024-02-21 17-25-12).png
	Feb 21 2024, 3:18 PM

	F42022927: LiftWing article-descriptions InboundPassthroughClusterIpv4 latency (Screenshot from 2024-02-21 17-26-00).png
	Feb 21 2024, 3:18 PM

	F42022901: LiftWing article-descriptions article-descriptions-predictor-default-00025-private latency (Screenshot from 2024-02-21 17-29-13).png
	Feb 21 2024, 3:18 PM

	F42016896: article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing
	Feb 21 2024, 8:15 AM

Subscribers

Isaac

isarantopoulos

	isvc request in debug mode to return latency field
	```
	$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2, "debug": 1}' -H "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
	{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.04478883743286133,"total network (s)":0.4124155044555664,"model (s)":2.252171277999878,"total (s)":2.664602756500244},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}
	real 0m2.699s
	user 0m0.005s
	sys 0m0.008s
	```

	formatted JSON response for easier reading
	```
	{
	"lang": "en",
	"title": "Clandonald",
	"blp": false,
	"num_beams": 2,
	"groundtruth": "Hamlet in Alberta, Canada",
	"latency": {
	"wikidata-info (s)": 0.04478883743286133,
	"total network (s)": 0.4124155044555664,
	"model (s)": 2.252171277999878,
	"total (s)": 2.664602756500244
	},
	"features": {
	"descriptions": {
	"fr": "hameau d'Alberta",
	"en": "hamlet in central Alberta, Canada"
	},
	"first-paragraphs": {
	"en": "Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.",
	"fr": "Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."
	}
	},
	"prediction": [
	"Hamlet in Alberta, Canada",
	"human settlement in Alberta, Canada"
	]
	}
	```

Event Timeline

kevinbazira created this paste.Feb 21 2024, 8:15 AM

The preprocess() method calculates its runtime using execution_times["total network (s)"] as shown here. Based on the formatted JSON response shown above, the preprocess runtime is about 0.4s. These results match @Isaac's comment in T343123#9527462.

According to T343123#9520331, the bottleneck is the preprocess step. However, the formatted JSON response shown above and the code profiling we did in T353127#9399942 indicate that the bottleneck is the predict step.

After having a chat with @isarantopoulos on IRC, I ran a comparison between the preprocess() and predict() methods. I used the sample inputs in P54507 to determine if there was a discrepancy between LiftWing and a LocalServer running on the ml-sandbox.

Based on the results shown in the table below, the preprocess step on LiftWing takes significantly longer compared to the preprocess step on LocalServer, even when LocalServer is running on fewer compute resources. Although preprocess on LiftWing is slower than preprocess on LocalServer, predict takes much longer time than preprocess on both LiftWing and LocalServer. This makes predict the main bottleneck.

Request Payload	LiftWing		LocalServer
	preprocess()	predict()	preprocess()	predict()
{"lang": "en", "title": "Clandonald", "num_beams": 3, "debug": 1}	0.393696547	3.089810848	0.116380453	4.053587675
{"lang": "fr", "title": "Tour_Eiffel", "num_beams": 3, "debug": 1}	5.164959669	4.941433907	0.358972549	6.980854988
{"lang": "it", "title": "Ravenna", "num_beams": 3, "debug": 1}	3.588263035	3.494236469	0.300351143	4.699475765
{"lang": "es", "title": "Madrid", "num_beams": 3, "debug": 1}	5.514351368	5.462801933	0.422098398	7.945872545
{"lang": "de", "title": "Kölner_Dom", "num_beams": 3, "debug": 1}	3.92395401	3.972553015	0.315475464	5.495382547
{"lang": "nl", "title": "Delfts_blauw", "num_beams": 3, "debug": 1}	1.895589828	3.903458357	0.18949461	5.860885143
{"lang": "ja", "title": "伏見稲荷大社", "num_beams": 3, "debug": 1}	2.373053789	5.152765036	0.267624855	6.812711
{"lang": "zh", "title": "大熊猫", "num_beams": 3, "debug": 1}	4.939737797	6.26681304	0.343618155	8.47557354
{"lang": "ko", "title": "K-pop", "num_beams": 3, "debug": 1}	4.305467129	5.383946896	0.271742821	7.498938084
{"lang": "vi", "title": "Áo_dài", "num_beams": 3, "debug": 1}	2.402287245	2.414685726	0.261023283	3.141326427
{"lang": "ru", "title": "Московский_Кремль", "num_beams": 3, "debug": 1}	3.180792332	4.239145279	0.316680908	5.497658491
{"lang": "cs", "title": "Martina_Navrátilová", "num_beams": 3, "debug": 1}	4.523844719	5.620767593	0.3497262	7.412033796
{"lang": "fi", "title": "Muumilaakson_tarinoita", "num_beams": 3, "debug": 1}	1.890240908	7.186655521	0.501304865	9.384730816
{"lang": "lt", "title": "Vilnius", "num_beams": 3, "debug": 1}	4.773990393	3.593998194	0.352459431	4.712741375
{"lang": "lv", "title": "Kristaps_Porziņģis", "num_beams": 3, "debug": 1}	2.902407169	3.347532988	0.282046556	4.313842058
{"lang": "et", "title": "Naisekandmine", "num_beams": 3, "debug": 1}	2.09082222	4.651262283	0.237919807	6.104430437
{"lang": "tr", "title": "Lokum", "num_beams": 3, "debug": 1}	2.831333876	2.376792192	0.255112171	2.86655283
{"lang": "ro", "title": "Țara_Românească", "num_beams": 3, "debug": 1}	2.077201366	2.840348244	0.249173641	4.163508892
{"lang": "kk", "title": "Қазақ_тілі", "num_beams": 3, "debug": 1}	3.741629601	2.808483124	0.370487928	3.881841183
{"lang": "gu", "title": "પટોળા", "num_beams": 3, "debug": 1}	0.588500261	4.91782403	0.135048151	6.921203136
{"lang": "hi", "title": "बुद्ध_(उपाधि)", "num_beams": 3, "debug": 1}	3.818928719	4.605582476	0.314588547	5.683538198
{"lang": "si", "title": "නල්ලූර්_කන්දස්වාමි_කෝවල", "num_beams": 3, "debug": 1}	0.649841785	6.321661711	0.232743979	7.659542799
{"lang": "my", "title": "သနပ်ခါး", "num_beams": 3, "debug": 1}	1.551228046	5.641949654	0.189154863	6.954239845
{"lang": "ne", "title": "दसैँ", "num_beams": 3, "debug": 1}	1.86123848	2.844894886	0.17886591	4.162917376

Average	2.957640012	4.378308475	0.283837279	5.861807873

To further understand the latency on LiftWing, I looked at backends and p0.99 in this grafana dashboard: https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?from=now-3h&orgId=1&to=now&var-backend=rest-gateway.discovery.wmnet&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-namespace=experimental&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&var-response_code=All

1.article-descriptions-predictor-default-00025-private has the highest latency spike at 13.7s

LiftWing article-descriptions article-descriptions-predictor-default-00025-private latency (Screenshot from 2024-02-21 17-29-13).png (741×1 px, 155 KB)

2.InboundPassthroughClusterIpv4 has the second highest latency spike at 4.8s

LiftWing article-descriptions InboundPassthroughClusterIpv4 latency (Screenshot from 2024-02-21 17-26-00).png (741×1 px, 124 KB)

3.following closely behind is rest-gateway.discovery.wmnet with a spike of 4.65s

LiftWing article-descriptions rest-gateway.discovery.wmnet latency (Screenshot from 2024-02-21 17-25-12).png (741×1 px, 131 KB)

4.lastly, api-ro.discovery.wmnet has the lowest spike at 246ms

LiftWing article-descriptions api-ro.discovery.wmnet latency (Screenshot from 2024-02-21 17-25-36).png (741×1 px, 120 KB)

If we skip article-descriptions-predictor-default-00025-private because when utilizing the GPU it stops being the bottleneck as shown here, then we'll have to look at optimizing InboundPassthroughClusterIpv4 and rest-gateway.discovery.wmnet.

Oh wow good sleuthing. To help me understand, does this sound right: the challenge with preprocess is that it for every language (up to 25) that the article exists in, an API call has to be made to that language edition's page summary REST endpoint to get the first paragraph (enwiki endpoint). And as we can see, this normally is still under half a second because it's a pretty quick API and we do the calls async. But presumably something on LiftWing is preventing the up-to-25 API calls from being made async/simultaneously?

kevinbazira mentioned this in T358195: Investigate increased preprocessing latencies on LW of article-descriptions model.Feb 22 2024, 10:53 AM

@Isaac, you're spot on! The main difference to note is that when we're running the model server on LiftWing, it accesses the REST endpoint via the Rest Gateway using http://rest-gateway.discovery.wmnet:4111/{lang}.wikipedia.org/v1/page/summary/{title}. However, when we're running the model server locally, it accesses the REST endpoint via the Wikimedia REST API using https://{lang}.wikipedia.org/api/rest_v1/page/summary/{title}.

@kevinbazira thanks for explaining -- I was unaware of the Rest Gateway etc. stuff so assumed LiftWing was using the same entrypoints to the APIs. I saw the other ticket where you're working through possibilities. I'll monitor but sounds like you all have it and thanks for digging into this!

article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWingActivePublicActions

Event Timeline

article-descriptions: evaluate `preprocess()` and `predict()` runtime on LiftWing
ActivePublic
Actions