Page MenuHomePhabricator

Use non-blocking HTTP calls to get outlinks for Outlinks topic model
Closed, ResolvedPublic

Description

Similar to T309623, but this task is for Outlinks topic model (non-revscoring-based model). We should test if we can reduce the latency by using non-blocking HTTP calls in Kserve to get outlinks from an article and their associated Wikidata IDs for predicting article topics. Currently the outlink_transformer calls the MW API with a blocking code.

Event Timeline

Change 807135 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] outlink: use tornado async http client to fetch outlinks

https://gerrit.wikimedia.org/r/807135

Some test results for model using async http calls:

aikochou@ml-sandbox:~/isvcs/outlink$ wrk -c 1 -t 1 --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.55s   335.17ms   2.75s    66.67%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    2.73s 
     75%    2.75s 
     90%    2.75s 
     99%    2.75s 
  3 requests in 10.02s, 1.83KB read
Requests/sec:      0.30
Transfer/sec:     187.17B

aikochou@ml-sandbox:~/isvcs/outlink$ wrk -c 4 -t 2 --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.82s   641.08ms   4.17s    58.33%
    Req/Sec     0.75      1.14     3.00     83.33%
  Latency Distribution
     50%    2.85s 
     75%    3.15s 
     90%    3.54s 
     99%    4.17s 
	12 requests in 10.02s, 7.32KB read
Requests/sec:      1.20
Transfer/sec:     748.68B

aikochou@ml-sandbox:~/isvcs/outlink$ wrk --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.34s   222.14ms   2.98s    63.16%
    Req/Sec     5.23      4.73    20.00     48.39%
  Latency Distribution
     50%    2.30s 
     75%    2.54s 
     90%    2.67s 
     99%    2.98s 
  38 requests in 10.02s, 23.19KB read
Requests/sec:      3.79
Transfer/sec:      2.32KB

For model using a blocking mwapi:

aikochou@ml-sandbox:~/isvcs/outlink$ wrk -c 1 -t 1 --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.09s   101.48ms   2.24s    75.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    2.07s 
     75%    2.24s 
     90%    2.24s 
     99%    2.24s 
  4 requests in 10.02s, 2.44KB read
Requests/sec:      0.40
Transfer/sec:     249.62B
aikochou@ml-sandbox:~/isvcs/outlink$ wrk -c 4 -t 2 --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  2 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.50s     1.00s   10.00s    75.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%   10.00s 
     75%   10.00s 
     90%   10.00s 
     99%   10.00s 
  4 requests in 10.02s, 2.44KB read
Requests/sec:      0.40
Transfer/sec:     249.58B
aikochou@ml-sandbox:~/isvcs/outlink$ wrk --timeout 10s -s inference.lua http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict --latency
Running 10s test @ http://192.168.49.2:30066/v1/models/outlink-topic-model:predict
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us    -nan%
    Req/Sec     0.00      0.00     0.00      -nan%
  Latency Distribution
     50%    0.00us
     75%    0.00us
     90%    0.00us
     99%    0.00us
  0 requests in 10.02s, 0.00B read
Requests/sec:      0.00
Transfer/sec:       0.00B
achou changed the task status from Open to In Progress.Jun 21 2022, 3:17 PM
achou moved this task from Parked to In Progress on the Machine-Learning-Team (Active Tasks) board.
achou renamed this task from Use async http client of Tornado to get outlinks from the article to Use non-blocking HTTP calls to get outlinks for Outlinks topic model .Jul 19 2022, 3:15 PM
achou changed the task status from In Progress to Open.
achou updated the task description. (Show Details)

Change 807135 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] outlink: use async HTTP calls to fetch data

https://gerrit.wikimedia.org/r/807135

Change 818052 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] outlink: allow accessing MediaWiki API through internal endpoint

https://gerrit.wikimedia.org/r/818052

Change 818052 merged by Elukey:

[machinelearning/liftwing/inference-services@main] outlink: allow accessing MediaWiki API through internal endpoint

https://gerrit.wikimedia.org/r/818052

I found out the response time for outlink model highly depends on the input article.

When the queried article is long and has many wikilinks (the feature that the outlink model used to infer article topics), for instance article Toni Morrison, it would probably take multiple continuing queries to get all the wikilinks. (see API:Query#Example_4:_Continuing_queries) Currently MW API returns 50 links each call, determined by the parameter gpllimit, not sure what is the maximum value we can set. The response time for the query in prod is 2.939s.

But when the queried article is shorter or has fewer wikilinks, for instance article Wings of Fire (novel series), the response time for the query in prod is only 0.330s.

If we looked at the logs for the predicator pod, we see the model only takes a few milliseconds to do inference for both cases:

[I 220801 10:32:29 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 4.51ms
[I 220801 11:15:21 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 2.85ms

If we looked at the logs for the transformer pod, there was a big difference:

[I 220801 10:32:29 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 2702.58ms
[I 220801 11:15:21 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 156.97ms

Performance test results

Test article: Toni Morrison

aikochou@deploy1002:~$ wrk -c 1 -t 1 --timeout 10s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.79s    81.66ms   1.88s    60.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    1.77s 
     75%    1.86s 
     90%    1.88s 
     99%    1.88s 
  5 requests in 10.02s, 3.05KB read
Requests/sec:      0.50
Transfer/sec:     311.94B
aikochou@deploy1002:~$ wrk -c 3 -t 3 --timeout 10s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  3 threads and 3 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.83s   123.87ms   2.08s    66.67%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    1.82s 
     75%    1.91s 
     90%    1.99s 
     99%    2.08s 
  15 requests in 10.02s, 9.16KB read
Requests/sec:      1.50
Transfer/sec:      0.91KB
aikochou@deploy1002:~$ wrk -c 5 -t 5 --timeout 10s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.77s   113.23ms   1.95s    64.00%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    1.79s 
     75%    1.83s 
     90%    1.92s 
     99%    1.95s 
  25 requests in 10.02s, 15.26KB read
Requests/sec:      2.50
Transfer/sec:      1.52KB
aikochou@deploy1002:~$ wrk -c 10 -t 10 --timeout 10s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.78s   200.52ms   2.53s    89.74%
    Req/Sec     0.00      0.00     0.00    100.00%
  Latency Distribution
     50%    1.75s 
     75%    1.81s 
     90%    1.97s 
     99%    2.53s 
  39 requests in 10.02s, 23.80KB read
  Socket errors: connect 2, read 0, write 0, timeout 0
Requests/sec:      3.89
Transfer/sec:      2.38KB

Test article: Wings of Fire (novel series)

aikochou@deploy1002:~$ wrk -c 1 -t 1 --timeout 2s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   112.51ms   35.21ms 244.94ms   85.71%
    Req/Sec     9.18      3.80    20.00     70.59%
  Latency Distribution
     50%   99.00ms
     75%  114.81ms
     90%  161.80ms
     99%  244.94ms
  36 requests in 10.02s, 18.89KB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:      3.59
Transfer/sec:      1.89KB
aikochou@deploy1002:~$ wrk -c 3 -t 3 --timeout 2s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  3 threads and 3 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   148.69ms  143.21ms   1.05s    89.95%
    Req/Sec     9.71      3.19    20.00     84.04%
  Latency Distribution
     50%   96.93ms
     75%  114.16ms
     90%  299.94ms
     99%  755.43ms
  198 requests in 10.02s, 103.90KB read
Requests/sec:     19.76
Transfer/sec:     10.37KB
aikochou@deploy1002:~$ wrk -c 5 -t 5 --timeout 2s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  5 threads and 5 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   165.28ms  178.87ms   1.21s    90.00%
    Req/Sec     9.59      3.14    20.00     84.06%
  Latency Distribution
     50%   97.19ms
     75%  121.95ms
     90%  358.32ms
     99%  956.80ms
  336 requests in 10.02s, 176.31KB read
  Socket errors: connect 0, read 0, write 0, timeout 1
Requests/sec:     33.53
Transfer/sec:     17.59KB
aikochou@deploy1002:~$ wrk -c 10 -t 10 --timeout 2s -s inference.lua https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict --latency
Running 10s test @ https://inference.svc.eqiad.wmnet:30443/v1/models/outlink-topic-model:predict
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   168.71ms  192.80ms   1.50s    89.47%
    Req/Sec     9.48      3.08    20.00     84.41%
  Latency Distribution
     50%   99.48ms
     75%  121.08ms
     90%  380.11ms
     99%    1.04s 
  554 requests in 10.02s, 290.74KB read
  Socket errors: connect 2, read 0, write 0, timeout 1
Requests/sec:     55.29
Transfer/sec:     29.02KB

Overall, we see improvement on performance using async preprocess(). :)

Currently MW API returns 50 links each call, determined by the parameter gpllimit, not sure what is the maximum value we can set.

The "gpllimit" has a max value of 500, so I changed it to 500 to improve the MW API call performance.
https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/837642

Tested in staging. For the article Toni Morrison, the response time for the MW API call was reduced from 2702.58ms to 490.53ms.

[I 221010 09:25:29 web:2243] 200 POST /v1/models/outlink-topic-model:predict (127.0.0.1) 490.53ms

That's super nice.

But I also observed some warnings in logs:

[W 221010 09:25:29 async_session:98] 	- main -- {'warnings': 'HTTP used when HTTPS was expected.\nSubscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application.'}