Page MenuHomePhabricator

Document load test results
Open, MediumPublic3 Estimated Story Points

Description

As an engineer,
I would like to have all load test results for each model server in one place so that I can have a reference to compare results whenever I need to make changes to see if latencies are increased.

A good place could be a Markdown file committed in the inference-services repository along with the corresponding input data and commands. That way it will also act as a runbook one can just follow before/after deploying a model server if needed.

Event Timeline

calbon triaged this task as Medium priority.Nov 28 2023, 3:52 PM

I think we can close this task as it can be covered by the work done in https://phabricator.wikimedia.org/T348850

@isarantopoulos, I've been looking into locust as we discussed, was going to share my findings here but noticed you were suggesting to close it.

This is how I was thinking about it:

  1. T348850 used to choose the load test framework
  2. T351939 works on using the chosen framework to document the load test results and can be made a sub-task of T348850

Should I proceed to make T351939 a sub-task of T348850 and share my results here?

Sure! I just suggested to close it cause it was going to be covered by the other task. But making it a subtask is even better!

Great! My exploration of locust mainly focused on how this tool would enable us to store and compare the results of our previous load tests. Here is the procedure I used to store the results of a load test:

1.Environment Setup:

Unlike wrk which exists on the deploy2002 machine and has enabled us to all have the same test environment, locust is not installed on the deploy2002. To maintain consistency of our test results, the team needs to settle on a single test machine (either stat1008 or ml-sandbox) and install locust using a Python venv. This uniformity will enable us to avoid variances that are connected to env setup.

I set up my testing environment on stat1008 as its response times were comparable to those of deploy2002:

$ python3 -m venv load_test_venv
$ source load_test_venv/bin/activate
$ pip install locust==2.17.0

2.Create Load Test Scripts:

I created article_descriptions.py to simulate the curl request below using locust:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

3.Run Load Tests & Save Results:

locust comes with a convenient web UI that could enable us to easily run tests and download the results. Since we are primarily working with the terminal and have no access to a browser on the ml-sandbox, we shall mainly use the locust CLI in headless mode.

There are locust plugins available that facilitate result storage in various formats, some of which could even enable us to use these results in our Grafana dashboards.

locust also has a nifty --csv flag that would enable us to easily save the results in a csv format. Here is how I used it:

  • created a locust.conf file that has all the flags which helps not to specify them directly on the terminal
  • ran the load test by running $ locust within the same directory as locust.conf

The files produced were: article_descriptions_exceptions.csv, article_descriptions_stats_history.csv, article_descriptions_failures.csv, and article_descriptions_stats.csv

4.Next Steps:

The next step will be to work on comparing results from previous load tests.

I have been studying the files produced in T351939#9466020 and it looks like article_descriptions_stats.csv contains the fields that we'll want to compare:

TypeNameRequest CountFailure CountMedian Response TimeAverage Response TimeMin Response TimeMax Response TimeAverage Content SizeRequests/sFailures/s50%66%75%80%90%95%98%99%99.9%99.99%100%
POST/v1/models/article-descriptions:predict2035053570.535053636167.00.24487516606619480.036003600360036003600360036003600360036003600
Aggregated2035053570.535053636167.00.24487516606619480.036003600360036003600360036003600360036003600

However, this file gets generated every time we run a test, so we lose data from the previous test.

To avoid losing this data, we might have to extract it and save it in a separate file along with a timestamp. This way, we can maintain a record of historical data and perform comparative analysis on it later.

locust has a test_stop event listener that can be used to read article_descriptions_stats.csv after it has been generated at the end of a load test run. I have updated the article_descriptions.py file to utilize this event hook to extract the "Aggregated" data from article_descriptions_stats.csv (as shown in T351939#9468383) and save it to lw_stats_history.csv. Here is what the contents of lw_stats_history.csv look like after running 3 load tests:

TimestampRequest CountFailure CountMedian Response TimeAverage Response TimeMin Response TimeMax Response TimeAverage Content SizeRequests/sFailures/s50%66%75%80%90%95%98%99%99.9%99.99%100%
202401181519482035283618.535283709167.00.228887335373979230.037003700370037003700370037003700370037003700
202401181520392035003567.034903644167.00.22916565815223690.036003600360036003600360036003600360036003600
202401181523562034393772.034394105167.00.23078843423267720.041004100410041004100410041004100410041004100

The next step will be working on running a comparative analysis on this data.

In order to compare historical data from T351939#9469592, I updated article_descriptions.py with lw_stats_analysis() and changed lw_stats_history() to use pandas instead of the csv module. Below is what the comparison report looks like for a given lw_stats_history.csv.

1.Comparison Report: generated by lw_stats_analysis() to add an easily digestible LiftWing-specific analysis when a locust load test run has completed.

(load_test_venv) kevinbazira@stat1008:~/liftwing_load_tests/article_descriptions$ locust
[2024-01-19 11:41:49,562] stat1008/INFO/locust.main: Python 3.7 support is deprecated and will be removed soon
[2024-01-19 11:41:49,563] stat1008/INFO/locust.main: Run time limit set to 10 seconds
[2024-01-19 11:41:49,563] stat1008/INFO/locust.main: Starting Locust 2.17.0
[2024-01-19 11:41:49,564] stat1008/INFO/locust.runners: Ramping to 1 users at a rate of 1.00 per second
[2024-01-19 11:41:49,565] stat1008/INFO/locust.runners: All users spawned: {"ArticleDescriptionsUser": 1} (1 total users)
[2024-01-19 11:41:59,085] stat1008/INFO/locust.main: --run-time limit reached, shutting down
    
    LIFTWING LOAD TEST COMPLETED
    The most recent load test for this isvc has been analyzed and presents improvement in performance:
    Average Response Time   -33.5
    Failure Count             0.0
    Compared with the prior load test, the performance is better.
    Detailed results of the current and past load tests are located in: lw_stats_history.csv.

[2024-01-19 11:41:59,109] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                        # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/article-descriptions:predict                          2     0(0.00%) |   3472    3376    3568   3400 |    0.23        0.00
--------|----------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                       2     0(0.00%) |   3472    3376    3568   3400 |    0.23        0.00

Response time percentiles (approximated)
Type     Name                                                                50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/article-descriptions:predict                            3600   3600   3600   3600   3600   3600   3600   3600   3600   3600   3600      2
--------|--------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                         3600   3600   3600   3600   3600   3600   3600   3600   3600   3600   3600      2

2.lw_stats_history.csv: generated by lw_stats_history()

TimestampRequest CountFailure CountMedian Response TimeAverage Response TimeMin Response TimeMax Response TimeAverage Content SizeRequests/sFailures/s50%66%75%80%90%95%98%99%99.9%99.99%100%
202401191141272034373505.534373574167.00.23374891256674370.036003600360036003600360036003600360036003600
202401191141592034003472.033763568167.00.23243047326125280.036003600360036003600360036003600360036003600

Going to discuss this prototype with the team and we agree on what modifications are needed to fit our load testing workflow and analysis requirements.

Until now, this locust prototype has been using the same payload to run a load test on the article-descriptions model-server. Since in wrk we were using multiple payloads by reading an input file, I have updated article_descriptions.py to replicate this functionality using process_payload().

The prototype now uses the csv reader locust plugin to iterate through input data and pass it to process_payload() which generates a payload for every user request.