Page MenuHomePhabricator

Optimize response performance for the article-descriptions model-server
Closed, ResolvedPublic3 Estimated Story Points

Description

The Android team is currently testing the article-descriptions model-server hosted in the experimental namespace on LiftWing. They have reported that the response latency is quite slow, ranging between 10-20s. They would like the response performance optimized and reduce the latency to a range of 500ms-3s.

Event Timeline

The article-descriptions model-server container hosted on LiftWing uses 1 CPU and 4GB of memory. Here are the results showing the response performance of requests to 2 languages:

  1. English request:
kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m14.194s
user	0m0.004s
sys	0m0.010s
  1. French request:
kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "fr", "title": "Pachystomias microdon", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["espèce de poissons","espèce poissons"],"blp":false,"lang":"fr","title":"Pachystomias microdon","num_beams":2}
real	0m11.191s
user	0m0.011s
sys	0m0.003s

The same container hosted in the ML sandbox uses 8 CPUs and 4GB of memory. Here are the results showing the response performance of the 2 requests from above:

  1. English request:
somebody@864ffd7c9089:~$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H "Content-type: application/json"
{"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m3.341s
user	0m0.008s
sys	0m0.016s
  1. French request:
somebody@864ffd7c9089:~$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "fr", "title": "Pachystomias microdon", "num_beams": 2}' -H "Content-type: application/json"
{"prediction":["espèce de poissons","espèce poissons"],"blp":false,"lang":"fr","title":"Pachystomias microdon","num_beams":2}
real	0m2.621s
user	0m0.013s
sys	0m0.009s

Below are performence metrics showing the number of CPUs vs response time for the english request on the ML sandbox with 4GB of memory:

CPUsResponse Time
11m10.185s
20m27.351s
40m9.258s
80m3.446s

Based on the above tests, the configuration with 8 CPUs comes closer to achieving our goal of a response time of ~3s. However, before implementing this configuration in the experimental namespace on LiftWing, I am going to explore other options to optimize the response performance.

The next optimization option I explored after increasing CPU and memory resources in the ML sandbox, was code profiling to figure out which parts of the code were taking the longest execution time so that we could improve them. This profiling was done using the lowest performing container with 1 CPU and 4GB of memory and it focused on the key methods (load, preprocess, predict) that a request goes through when it's sent to the model-server before a response is returned. Below are the profiling results:

  1. load()
import cProfile
import model
import pstats
profile = cProfile.Profile()
# Load Model
profile.run("model.ArticleDescriptionsModel('article_descriptions')")
results = pstats.Stats(profile)
results.sort_stats('time')
results.print_stats(20)

>>>

         3742905 function calls (3697658 primitive calls) in 5.854 seconds

   Ordered by: internal time
   List reduced from 750 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1476    2.751    0.002    2.774    0.002 /opt/lib/python/site-packages/torch/serialization.py:1109(load_tensor)
        1    0.579    0.579    0.579    0.579 {built-in method sentencepiece._sentencepiece.SentencePieceProcessor_LoadFromFile}
        2    0.270    0.135    0.270    0.135 {built-in method gc.collect}
        1    0.127    0.127    0.960    0.960 /opt/lib/python/site-packages/transformers/models/mbart/tokenization_mbart.py:284(<dictcomp>)
   249997    0.127    0.000    0.493    0.000 /opt/lib/python/site-packages/sentencepiece/__init__.py:492(_func)
   249997    0.120    0.000    0.120    0.000 {built-in method sentencepiece._sentencepiece.SentencePieceProcessor_IdToPiece}
   250027    0.120    0.000    0.833    0.000 /opt/lib/python/site-packages/transformers/tokenization_utils.py:953(convert_ids_to_tokens)
   250027    0.115    0.000    0.691    0.000 /opt/lib/python/site-packages/transformers/models/mbart/tokenization_mbart.py:300(_convert_id_to_token)
      593    0.107    0.000    0.156    0.000 /opt/lib/python/site-packages/transformers/modeling_utils.py:513(<listcomp>)
   249997    0.082    0.000    0.575    0.000 /opt/lib/python/site-packages/sentencepiece/__init__.py:497(_batched_func)
   249997    0.070    0.000    0.188    0.000 /opt/lib/python/site-packages/sentencepiece/__init__.py:330(piece_size)
   250000    0.063    0.000    0.118    0.000 /opt/lib/python/site-packages/sentencepiece/__init__.py:131(GetPieceSize)
   249997    0.058    0.000    0.178    0.000 /opt/lib/python/site-packages/sentencepiece/__init__.py:137(IdToPiece)
   250000    0.055    0.000    0.055    0.000 {built-in method sentencepiece._sentencepiece.SentencePieceProcessor_GetPieceSize}
        1    0.054    0.054    5.864    5.864 <string>:1(<module>)
413708/402545    0.049    0.000    0.065    0.000 {built-in method builtins.isinstance}
        1    0.048    0.048    0.088    0.088 /opt/lib/python/site-packages/transformers/models/bert/tokenization_bert.py:117(load_vocab)
       10    0.048    0.005    0.048    0.005 /usr/lib/python3.9/json/decoder.py:343(raw_decode)
   318483    0.048    0.000    0.048    0.000 {method 'startswith' of 'str' objects}
        1    0.043    0.043    0.043    0.043 /opt/lib/python/site-packages/transformers/models/bert/tokenization_bert.py:205(<listcomp>)
  1. preprocess()
import asyncio
import cProfile
import model
import pstats
# Load Model
model_server = model.ArticleDescriptionsModel("article_descriptions")
profile = cProfile.Profile()
input = {"lang": "en", "title": "Clandonald", "num_beams": 2}
# Preprocess
profile.run("asyncio.run(model_server.preprocess(input))")
results = pstats.Stats(profile)
results.sort_stats('time')
results.print_stats(20)

>>>
         23701 function calls (22483 primitive calls) in 0.135 seconds

   Ordered by: internal time
   List reduced from 853 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       75    0.087    0.001    0.087    0.001 {method 'poll' of 'select.epoll' objects}
       11    0.005    0.000    0.005    0.000 {method 'do_handshake' of '_ssl._SSLSocket' objects}
    898/8    0.003    0.000    0.008    0.001 /usr/lib/python3.9/copy.py:128(deepcopy)
      149    0.002    0.000    0.002    0.000 {built-in method __new__ of type object at 0x906da0}
       75    0.001    0.000    0.134    0.002 /usr/lib/python3.9/asyncio/base_events.py:1815(_run_once)
       13    0.001    0.000    0.001    0.000 {method 'acquire' of '_thread.lock' objects}
    120/8    0.001    0.000    0.007    0.001 /usr/lib/python3.9/copy.py:258(_reconstruct)
       16    0.001    0.000    0.002    0.000 /usr/lib/python3.9/http/cookies.py:539(__parse_string)
       14    0.001    0.000    0.001    0.000 {method 'send' of '_socket.socket' objects}
       27    0.001    0.000    0.001    0.000 {method 'recv' of '_socket.socket' objects}
     96/8    0.001    0.000    0.007    0.001 /usr/lib/python3.9/copy.py:226(_deepcopy_dict)
      236    0.001    0.000    0.001    0.000 {method 'match' of 're.Pattern' objects}
        4    0.001    0.000    0.001    0.000 /opt/lib/python/site-packages/aiohttp/client_proto.py:201(data_received)
     2020    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
       75    0.001    0.000    0.088    0.001 /usr/lib/python3.9/selectors.py:452(select)
       16    0.001    0.000    0.001    0.000 {method 'read' of '_ssl._SSLSocket' objects}
      248    0.000    0.000    0.001    0.000 /usr/lib/python3.9/copy.py:242(_keep_alive)
       20    0.000    0.000    0.014    0.001 /opt/lib/python/site-packages/aiohttp/client.py:379(_request)
      120    0.000    0.000    0.001    0.000 {method '__reduce_ex__' of 'object' objects}
     1386    0.000    0.000    0.000    0.000 {built-in method builtins.id}
  1. predict()
import asyncio
import cProfile
import model
import pstats
# Load Model
model_server = model.ArticleDescriptionsModel("article_descriptions")
# Preprocess
input = {"lang": "en", "title": "Clandonald", "num_beams": 2}
preprocessed_data = asyncio.run(model_server.preprocess(input))
profile = cProfile.Profile()
# Predict
profile.run("model_server.predict(preprocessed_data)")
results = pstats.Stats(profile)
results.sort_stats('time')
results.print_stats(20)

>>>
         800038 function calls (791313 primitive calls) in 72.600 seconds

   Ordered by: internal time
   List reduced from 1085 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2748   59.691    0.022   59.691    0.022 {built-in method torch._C._nn.linear}
        9    2.407    0.267   62.770    6.974 /srv/article_descriptions/model_server/descartes/src/models/descartes_mbart.py:51(forward)
      381    1.943    0.005    1.943    0.005 {built-in method torch._C._nn.gelu}
      465    1.030    0.002   22.212    0.048 /opt/lib/python/site-packages/transformers/models/mbart/modeling_mbart.py:163(forward)
      930    0.888    0.001    0.888    0.001 {built-in method torch.bmm}
     1419    0.785    0.001    0.785    0.001 {method 'contiguous' of 'torch._C._TensorBase' objects}
      894    0.715    0.001    0.715    0.001 {built-in method torch.layer_norm}
      489    0.542    0.001    0.542    0.001 {method 'softmax' of 'torch._C._TensorBase' objects}
  4905/14    0.515    0.000   71.717    5.123 /opt/lib/python/site-packages/torch/nn/modules/module.py:1494(_call_impl)
        9    0.453    0.050    0.453    0.050 {method 'log_softmax' of 'torch._C._TensorBase' objects}
     1395    0.409    0.000    0.409    0.000 {method 'reshape' of 'torch._C._TensorBase' objects}
      108    0.389    0.004   13.802    0.128 /opt/lib/python/site-packages/transformers/models/mbart/modeling_mbart.py:595(forward)
     2935    0.307    0.000    0.307    0.000 {method 'view' of 'torch._C._TensorBase' objects}
        9    0.206    0.023    0.206    0.023 {built-in method torch.topk}
     2349    0.204    0.000    0.204    0.000 {method 'transpose' of 'torch._C._TensorBase' objects}
     2748    0.204    0.000   59.903    0.022 /opt/lib/python/site-packages/torch/nn/modules/linear.py:113(forward)
       10    0.193    0.019    0.193    0.019 {built-in method torch.stack}
     1504    0.123    0.000    0.329    0.000 /opt/lib/python/site-packages/torch/nn/functional.py:1235(dropout)
     4935    0.113    0.000    0.113    0.000 {built-in method torch._C._get_tracing_state}
     1504    0.104    0.000    0.104    0.000 {built-in method torch.dropout}

The profiling data shows significant performance disparities between the three methods. While preprocess() is efficient at ~0.14s, both load() (~5.85s) and predict() (~72.60s) have longer execution times. Going to explore ways to improve the performance of both these methods.

isarantopoulos moved this task from Unsorted to In Progress on the Machine-Learning-Team board.
isarantopoulos set the point value for this task to 3.

Great work Kevin and thorough results!
The profiling helps a lot to understand what is going on. So the issue is in the predict function. For things that are internal to torch and transformers there isn't much we can do (ofc a GPU would really help with that).
Perhaps the only thing worth checking is if there is anything that could be optimized in the descartes package but I would suggest not to go down that road.
For the time being we can continue with testing some resources and try to use lower precision and quantization if possible to make the model size smaller.
I did a try using torch.half() but with no luck as I was missing some tensor conversion. With this approach both model and inputs should be converted to float16 instead of the default float32.

@isarantopoulos, thank you for the recommendations. I agree with you regarding things internal to torch, transformers and not going down the road of trying to optimize the descartes package. I'll continue looking into other optimization options.

@elukey suggested that we try adjusting the OMP_NUM_THREADS environment variable to match the CPUs, here are the performance results after testing this option on the ML sandbox:

CPUsOMP_NUM_THREADSNo. of ThreadsResponse Time
11220m15.051s
22240m8.783s
44280m5.106s
88360m3.707s

The results we got after setting the OMP_NUM_THREADS option are much better compared to the ones we had before in T353127#9398823.

Thanks @kevinbazira for these tests! Does bumping the RAM help at all or is that largely out of the question for other reasons?

@Isaac, bumping the RAM does not change much as shown in the request below that was run on 1 CPU and 8GB memory, it returns results similar to 1 CPU and 4GB memory in T353127#9406015.

somebody@eacb54ccb3b3:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H "Content-type: application/json"
{"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m15.311s
user	0m0.018s
sys	0m0.009s

A few days ago, we looked at the maximum memory usage of the article-descriptions model-server and it was about ~3.45GB, as shown below:

kevinbazira@ml-sandbox:~$ cat /sys/fs/cgroup/memory/docker/864ffd7c908992bd20ae5f8618f4618b4227e0d9e02e761d9667d8f951f7cb70/memory.max_usage_in_bytes
3447660544

I have explored dynamic quantization on the ML sandbox and this almost halved the response time we had in T353127#9406015:

CPUsResponse Time
10m6.397s
20m4.019s
40m2.497s
80m1.859s

Although this has much better respone time, it uses 8GB memory and the prediction has changed. In T353127#9398823 the prediction was ["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"], now it is ["Town in Alberta, Canada","Hamlet in Alberta, Canada"] as shown below:

somebody@c4c12ad0eaaf:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H "Content-type: application/json"
{"prediction":["Town in Alberta, Canada","Hamlet in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m1.859s
user	0m0.006s
sys	0m0.010s

@Isaac, please confirm whether these changes in prediction could have a negative impact on quality and reliability? It's important for us to weigh the performance gains against any potential compromises in prediction accuracy.

@kevinbazira thanks for checking on the RAM and these other experiments!

bumping the RAM does not change much

Bummer but in some ways I'm glad the RAM usage is relatively low.

@Isaac, please confirm whether these changes in prediction could have a negative impact on quality and reliability? It's important for us to weigh the performance gains against any potential compromises in prediction accuracy.

Looping in @JTannerWMF as Android should weigh in. My personal feeling is that we strongly want to avoid changing outputs at this point given that we did fair bit of human evaluation of the non-quantized model to feel comfortable with moving forward with it to production. It looks like you were able to get it to the ~2sec range too which is great to see and brings us to parity with the Cloud VPS instance at least. Would it be possible to try CTranslate instead (or something similar that can optimize without quantizing)? It looks like they're able to achieve significant speed-ups even when they don't enable quantization and I know that Santhosh has been pretty happy with it in his work (I've personally tried on my local computer and get segmentation faults when trying to convert models but I'm hopeful that's not the general experience).

It looks like you were able to get it to the ~2sec range too which is great to see and brings us to parity with the Cloud VPS instance at least.

I am very curious about his point, since without quantization it seems that we are doubling the time that it takes to render. I tried to compare the following two and the running times are not very different:

$ time curl -i "https://ml-article-description-api.wmcloud.org/article?lang=en&title=Clandonald&num_beams=2"
HTTP/2 200 
server: nginx/1.18.0
date: Mon, 18 Dec 2023 15:15:19 GMT
content-type: application/json
content-length: 758
access-control-allow-origin: *
strict-transport-security: max-age=31622400
x-clacks-overhead: GNU Terry Pratchett

{"lang":"en","title":"Clandonald","blp":false,"num_beams":2,"groundtruth":"Hamlet in Alberta, Canada","latency":{"wikidata-info (s)":0.0836176872253418,"total network (s)":0.15259981155395508,"total (s)":2.874441146850586},"features":{"descriptions":{"fr":"hameau d'Alberta","en":"hamlet in central Alberta, Canada"},"first-paragraphs":{"en":"Clandonald is a hamlet in central Alberta, Canada within the County of Vermilion River. It is located approximately 28 kilometres (17 mi) north of Highway 16 and 58 kilometres (36 mi) northwest of Lloydminster.","fr":"Clandonald est un hameau (hamlet) du Comté de Vermilion River, situé dans la province canadienne d'Alberta."}},"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"]}

real	0m3.305s
user	0m0.074s
sys	0m0.031s
elukey@stat1004:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m3.951s
user	0m0.022s
sys	0m0.005s

Both of them are running 8 virtual cores (I modified manually the allocated CPUs in staging, since it is currently set with only 4), and of course the last one misses the Internet latency. The Lift Wing version is not that far from the cloud VPS one, so I am wondering if we need to agree on what URLs are part of the testing set since we may end up setting different expectations. Does it make sense?

@Isaac Ctranslate2 supports specific models out of the box (nllb is one of them). Although mBART is supported and I was able to convert it, there are a lot of custom functions (forward passes and sample generation) related to the Descartes model which make it difficult to apply ctranslate2 easily.

Ctranslate2 supports specific models out of the box (nllb is one of them). Although mBART is supported and I was able to convert it, there are a lot of custom functions (forward passes and sample generation) related to the Descartes model which make it difficult to apply ctranslate2 easily.

Ahh - thanks for that clarification @isarantopoulos! Then maybe not great use of time given that we're hoping to have GPUs eventually solve this problem for us.

The Lift Wing version is not that far from the cloud VPS one, so I am wondering if we need to agree on what URLs are part of the testing set since we may end up setting different expectations. Does it make sense?

Good point @elukey: I don't have an established testing set for latency but I did some explorations a while back that I just refreshed (PAWS notebook). The latency seems to be a function of the number of input tokens to the model and number of output tokens (more tokens = more latency). I tested 75 random examples and got a best-case of 1 second, median of 3 seconds, and worst-case of 13 seconds. These are randomly-selected so presumably there's a worse-case out there that's significantly worse but the median should be accurate. A few examples from that notebook to help guide us (note I'm using 3 beams which is the intention for production). I didn't want to publish the LiftWing numbers here because I don't know if they're still using the 8 cores and don't want to make an improper comparison.

Median (standard)

$ time curl -i "https://ml-article-description-api.wmcloud.org/article?lang=my&title=ပုံစားရွာ&num_beams=3"

{"lang":"my","title":"ပုံစားရွာ","blp":false,"num_beams":3,"groundtruth":null,"latency":{"wikidata-info (s)":0.10079646110534668,"total network (s)":0.17826342582702637,"total (s)":3.071535348892212},"features":{"descriptions":{"nl":"nederzetting in Myanmar","ar":"مستوطنة في ميانمار","en":"human settlement in Myanmar","de":"Siedlung in Myanmar"},"first-paragraphs":{"my":"ပုံစားရွာ"}},"prediction":["မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အ","မြန်မာနိုင်ငံ၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊"]}

real	0m3.322s
user	0m0.017s
sys	0m0.021s

Lift-Wing call:

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "my", "title":"ပုံစားရွာ", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

Worse case

$ time curl -i "https://ml-article-description-api.wmcloud.org/article?lang=ja&title=%E5%9B%BD%E5%AE%B6%E3%81%AB%E3%82%88%E3%82%8B%E8%87%AA%E7%94%B1&num_beams=3"

{"lang":"ja","title":"国家による自由","blp":false,"num_beams":3,"groundtruth":null,"latency":{"wikidata-info (s)":0.08899879455566406,"total network (s)":0.16272592544555664,"total (s)":13.645478010177612},"features":{"descriptions":{},"first-paragraphs":{"ja":"国家による自由(こっかによるじゆう)とは、国家が介入することにより、国民が得られる自由ないしは権利のことを一般にいう。「国家からの自由」と対にして用いられることもあり、一般に自由権を指す「国家からの自由」に対し、国家による自由には、社会権が主に含まれる。社会権は、規制対象として\nむしろ国家に対し、特定の政策目標達成のための施策を行うことを求める権利である。当該政策目標が達成されることで、国民が取りうる活動の範囲が広がり、または、生活の基盤が確立される可能性から、「自由」として捉えるのが、国家による自由の場合の自由の語義となる。政府のサポートに基づいて得られる積極的自由として、自由権における消極的自由と対比して用いられることもある。国家による自由は、概念的に実体規定と手続規定を包摂するが、受動主体による手続形態は国家の公的概念によって遮断される。手続形態は行政概念にかかるものであり、受動主体は実体規定への自由が認められるものである。私的な手続形態は、公正概念および公的評価になじまないからである。"}},"prediction":["国家が介入することにより、国民が得られる自由","国家による自由がないしは権利のことを一般にいう","国家による自由"]}

real	0m13.890s
user	0m0.019s
sys	0m0.023s

Lift-Wing call:

isaacj@stat1008:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "ja", "title": "国家による自由", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

Thank you for sharing these benchmarks, @Isaac. I have made the same requests in the ML sandbox with 8 CPUs and here are the results:

Median (standard)

Without quantizaion:

somebody@c4c12ad0eaaf:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "my", "title":"ပုံစားရွာ", "num_beams": 3}' -H "Content-type: application/json"
{"prediction":["မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အ","မြန်မာနိုင်ငံ၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊"],"blp":false,"lang":"my","title":"ပုံစားရွာ","num_beams":3}
real	0m4.321s
user	0m0.014s
sys	0m0.011s

Quantizaion applied:

somebody@c4c12ad0eaaf:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "my", "title":"ပုံစားရွာ", "num_beams": 3}' -H "Content-type: application/json"
{"prediction":["မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အ","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်ရှိ အ","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်မြို့ရှိ"],"blp":false,"lang":"my","title":"ပုံစားရွာ","num_beams":3}
real	0m2.509s
user	0m0.016s
sys	0m0.009s

Worse case

Without quantizaion:

somebody@c4c12ad0eaaf:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "ja", "title": "国家による自由", "num_beams": 3}' -H "Content-type: application/json"
{"prediction":["国家が介入することにより、国民が得られる自由","国家による自由がないしは権利のことを一般にいう","国家による自由"],"blp":false,"lang":"ja","title":"国家による自由","num_beams":3}
real	0m17.429s
user	0m0.018s
sys	0m0.009s

Quantizaion applied:

somebody@c4c12ad0eaaf:/srv/article_descriptions$ time curl localhost:8080/v1/models/article-descriptions:predict -X POST -d '{"lang": "ja", "title": "国家による自由", "num_beams": 3}' -H "Content-type: application/json"
{"prediction":["国家が介入することにより、国民が得られる自由や思想の総称","国家が保有することにより、国民が得られる自由の権利","国家が保有することにより、国民が得られる自由の権利の一つ"],"blp":false,"lang":"ja","title":"国家による自由","num_beams":3}
real	0m17.205s
user	0m0.018s
sys	0m0.009s

Going to push a patch to increase the CPUs on LiftWing to 8 so that we can run the same tests.

Change 983244 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: bump CPUs to compare with Research team benchmarks

https://gerrit.wikimedia.org/r/983244

Change 983244 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: bump CPUs to compare with Research team benchmarks

https://gerrit.wikimedia.org/r/983244

I have run the same tests on LiftWing with 8 CPUs and no quantization. Here are the results:

Median (standard)

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "my", "title":"ပုံစားရွာ", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

{"prediction":["မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အ","မြန်မာနိုင်ငံ၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊"],"blp":false,"lang":"my","title":"ပုံစားရွာ","num_beams":3}
real	0m5.557s
user	0m0.008s
sys	0m0.004s

Worse case

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "ja", "title": "国家による自由", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["国家が介入することにより、国民が得られる自由","国家による自由がないしは権利のことを一般にいう","国家による自由"],"blp":false,"lang":"ja","title":"国家による自由","num_beams":3}
real	0m16.371s
user	0m0.010s
sys	0m0.004s

Both requests from LiftWing and the ML sanbox that don't have quantization applied are slower than the Cloud VPS instance.

@kevinbazira thanks for the updates and additional test points!

Update re Cloud VPS API: I realized that the wmcloud API is running an older form of the library. The newer form (which was adjusted to reduce local dependencies and thus be a lot simpler to maintain long-term) does run about a half second or so slower it seems. Apologies for not thinking to check that sooner. This fact doesn't help with achieving desired latency but does help with understanding why equal CPU/RAM resources might be resulting in a slower LiftWing response.

Attempt at summarizing where we're at assuming these examples generalize out:

  • LiftWing is probably around median latency of ~4 seconds with 8 CPUs and 4GB RAM. Further RAM doesn't seem to have an effect.
  • GPUs are still a number of months out so that's not a short-term solution for achieving equal quality with lower latency.
  • CTranslate would be difficult to apply so isn't a good fit (custom transformations to again aim for equal quality at lower latency).
  • Quantization can lower median latency to say ~2 seconds but comes at the cost of some unknown change in quality (in practice generally small but we wouldn't know for sure without further evaluation that takes a while to gather thoroughly).

Are there pieces I'm missing or other ways to get to lower latency with equal quality that we haven't tested?

@Isaac, thank you for letting us know about the Cloud VPS API using an older library version. We have been wondering why the LiftWing latency is not matching that of the Cloud VPS instance yet we matched env resources, but this helps to explain it.

Regarding other ways to get to lower latency with equal quality, today we tried using less aggressive quantization with float16 instead of the default qint8 that we had used earlier. We managed to get prediction that had equal quality. Unfortunately, this had a ~4s performance which is not different from the non-quantized option so not much is gained with this option.

thank you for letting us know about the Cloud VPS API using an older library version. We have been wondering why the LiftWing latency is not matching that of the Cloud VPS instance yet we matched env resources, but this helps to explain it.

Yeah, sorry again about that. I didn't think the rewrite would affect latency so didn't think to test. I'm leaving the wmcloud API on the old version for now because the newer version doesn't have all the corresponding nginx etc. config that I'd need to set it up as a full wmcloud API but I installed a local flask version on wmcloud that I can test as needed.

Regarding other ways to get to lower latency with equal quality, today we tried using less aggressive quantization with float16 instead of the default qint8 that we had used earlier. We managed to get prediction that had equal quality. Unfortunately, this had a ~4s performance which is not different from the non-quantized option so not much is gained with this option.

Yeah, agreed then that this isn't worth the switch even if quality doesn't seem to shift much. I'm mostly out of ideas so unless folks have other thoughts, I'll let Android and ML Platform work out what this might mean. It does seem that we've gotten closer to what users would have experienced in the pilot but are probably still at least a second behind that due to the code changes.

Change 984236 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: bump CPUs to improve model-server performance

https://gerrit.wikimedia.org/r/984236

Change 984236 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: bump CPUs to improve model-server performance

https://gerrit.wikimedia.org/r/984236

After trying different optimization options (T353127#9416933) to meet the Android team's requirement of 500ms-3s response time, we were able to improve the performance for the original request from ~14s to ~4s response time without affecting the prediction quality. The option that gave us constant improvement was increasing CPUs, we went from 1 to 8 to match the Cloud VPS instance. As a final attempt, today we tested 16 CPUs, and below are the results:

Original request

improved from ~4s which we had in T353127#9406015 to ~3s:

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 2}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["Hamlet in Alberta, Canada","human settlement in Alberta, Canada"],"blp":false,"lang":"en","title":"Clandonald","num_beams":2}
real	0m2.584s
user	0m0.006s
sys	0m0.008s

Median (standard)

improved from ~6s which we had in T353127#9415033 to ~4s:

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "my", "title":"ပုံစားရွာ", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အ","မြန်မာနိုင်ငံ၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်ပြည်နယ်၊ အင်္ဂန်","မြန်မာနိုင်ငံ၊ အင်္ဂလန်၊ အင်္ဂလန်၊ အင်္ဂလန်၊"],"blp":false,"lang":"my","title":"ပုံစားရွာ","num_beams":3}
real	0m4.173s
user	0m0.013s
sys	0m0.001s

Worse case

improved from ~16s which we had in T353127#9415033 to ~11s:

kevinbazira@deploy2002:~$ time curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "ja", "title": "国家による自由", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1
{"prediction":["国家が介入することにより、国民が得られる自由","国家による自由がないしは権利のことを一般にいう","国家による自由"],"blp":false,"lang":"ja","title":"国家による自由","num_beams":3}
real	0m10.873s
user	0m0.013s
sys	0m0.000s

The results show that the original request got to the ~3s goal. SREs have advised that at the moment 16 CPUs should be the limit for allocating more compute resources.

@Isaac is there a dataset/set of requests you we can use to run load tests? I'm mostly asking if there is a set that was used during the pilot, otherwise we can grab a set of random articles.
We would like to run load tests to see what kind of traffic we can accommodate.

is there a dataset/set of requests you we can use to run load tests?

@isarantopoulos I didn't do anything special in selecting articles for load-testing beyond random. You can use the set I tested in https://public-paws.wmcloud.org/User:Isaac%20(WMF)/Article%20Descriptions/API_testing-updated.ipynb#Results (you can download the results TSV here with article titles / metadata) or just copy that Python code to generate your own test-set. The API will error when you request a description for an article that doesn't have a Wikidata item FYI, which is something that you might run into on languages like my where it seems to be more common.