Maniphest T351939

Document load test results
Open, MediumPublic3 Estimated Story Points
Actions

Assigned To

None

Authored By

	isarantopoulos
	Nov 24 2023, 5:03 PM

Description

As an engineer,
I would like to have all load test results for each model server in one place so that I can have a reference to compare results whenever I need to make changes to see if latencies are increased.

A good place could be a Markdown file committed in the inference-services repository along with the corresponding input data and commands. That way it will also act as a runbook one can just follow before/after deploying a model server if needed.

Related Objects
Search...

Status	Assigned	Task
Open	kevinbazira	T348156 Goal: Increase the number of models hosted on Lift Wing
Open	None	T348850 Establish a standard load testing procedure
Open	klausman	T352189 Document procedure for updating the underlying distro for model servers (e.g. Bullseye -> Bookworm)
Open	None	T351939 Document load test results

Event Timeline

isarantopoulos created this task.Nov 24 2023, 5:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 24 2023, 5:03 PM

isarantopoulos set the point value for this task to 3.Nov 28 2023, 2:58 PM

calbon triaged this task as Medium priority.Nov 28 2023, 3:52 PM

isarantopoulos added a parent task: T352189: Document procedure for updating the underlying distro for model servers (e.g. Bullseye -> Bookworm).Nov 28 2023, 3:56 PM

isarantopoulos moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.Dec 12 2023, 3:46 PM

I think we can close this task as it can be covered by the work done in https://phabricator.wikimedia.org/T348850

@isarantopoulos, I've been looking into locust as we discussed, was going to share my findings here but noticed you were suggesting to close it.

This is how I was thinking about it:

T348850 used to choose the load test framework
T351939 works on using the chosen framework to document the load test results and can be made a sub-task of T348850

Should I proceed to make T351939 a sub-task of T348850 and share my results here?

Sure! I just suggested to close it cause it was going to be covered by the other task. But making it a subtask is even better!

kevinbazira added a parent task: T348850: Establish a standard load testing procedure.Jan 17 2024, 1:21 PM

Great! My exploration of locust mainly focused on how this tool would enable us to store and compare the results of our previous load tests. Here is the procedure I used to store the results of a load test:

1.Environment Setup:

Unlike wrk which exists on the deploy2002 machine and has enabled us to all have the same test environment, locust is not installed on the deploy2002. To maintain consistency of our test results, the team needs to settle on a single test machine (either stat1008 or ml-sandbox) and install locust using a Python venv. This uniformity will enable us to avoid variances that are connected to env setup.

I set up my testing environment on stat1008 as its response times were comparable to those of deploy2002:

$ python3 -m venv load_test_venv
$ source load_test_venv/bin/activate
$ pip install locust==2.17.0

2.Create Load Test Scripts:

I created article_descriptions.py to simulate the curl request below using locust:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-descriptions:predict" -X POST -d '{"lang": "en", "title": "Clandonald", "num_beams": 3}' -H  "Host: article-descriptions.experimental.wikimedia.org" -H "Content-Type: application/json" --http1.1

3.Run Load Tests & Save Results:

locust comes with a convenient web UI that could enable us to easily run tests and download the results. Since we are primarily working with the terminal and have no access to a browser on the ml-sandbox, we shall mainly use the locust CLI in headless mode.

There are locust plugins available that facilitate result storage in various formats, some of which could even enable us to use these results in our Grafana dashboards.

locust also has a nifty --csv flag that would enable us to easily save the results in a csv format. Here is how I used it:

created a locust.conf file that has all the flags which helps not to specify them directly on the terminal
ran the load test by running $ locust within the same directory as locust.conf

The files produced were: article_descriptions_exceptions.csv, article_descriptions_stats_history.csv, article_descriptions_failures.csv, and article_descriptions_stats.csv

4.Next Steps:

The next step will be to work on comparing results from previous load tests.

I have been studying the files produced in T351939#9466020 and it looks like article_descriptions_stats.csv contains the fields that we'll want to compare:

Type	Name	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
POST	/v1/models/article-descriptions:predict	2	0	3505	3570.5	3505	3636	167.0	0.2448751660661948	0.0	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600
	Aggregated	2	0	3505	3570.5	3505	3636	167.0	0.2448751660661948	0.0	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600

However, this file gets generated every time we run a test, so we lose data from the previous test.

To avoid losing this data, we might have to extract it and save it in a separate file along with a timestamp. This way, we can maintain a record of historical data and perform comparative analysis on it later.

locust has a test_stop event listener that can be used to read article_descriptions_stats.csv after it has been generated at the end of a load test run. I have updated the article_descriptions.py file to utilize this event hook to extract the "Aggregated" data from article_descriptions_stats.csv (as shown in T351939#9468383) and save it to lw_stats_history.csv. Here is what the contents of lw_stats_history.csv look like after running 3 load tests:

Timestamp	Request Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
20240118151948	2	3528	3618.5	3528	3709	167.0	0.22888733537397923	3700	3700	3700	3700	3700	3700	3700	3700	3700	3700	3700
20240118152039	2	3500	3567.0	3490	3644	167.0	0.2291656581522369	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600
20240118152356	2	3439	3772.0	3439	4105	167.0	0.2307884342326772	4100	4100	4100	4100	4100	4100	4100	4100	4100	4100	4100

The next step will be working on running a comparative analysis on this data.

In order to compare historical data from T351939#9469592, I updated article_descriptions.py with lw_stats_analysis() and changed lw_stats_history() to use pandas instead of the csv module. Below is what the comparison report looks like for a given lw_stats_history.csv.

1.Comparison Report: generated by lw_stats_analysis() to add an easily digestible LiftWing-specific analysis when a locust load test run has completed.

(load_test_venv) kevinbazira@stat1008:~/liftwing_load_tests/article_descriptions$ locust
[2024-01-19 11:41:49,562] stat1008/INFO/locust.main: Python 3.7 support is deprecated and will be removed soon
[2024-01-19 11:41:49,563] stat1008/INFO/locust.main: Run time limit set to 10 seconds
[2024-01-19 11:41:49,563] stat1008/INFO/locust.main: Starting Locust 2.17.0
[2024-01-19 11:41:49,564] stat1008/INFO/locust.runners: Ramping to 1 users at a rate of 1.00 per second
[2024-01-19 11:41:49,565] stat1008/INFO/locust.runners: All users spawned: {"ArticleDescriptionsUser": 1} (1 total users)
[2024-01-19 11:41:59,085] stat1008/INFO/locust.main: --run-time limit reached, shutting down
    
    LIFTWING LOAD TEST COMPLETED
    The most recent load test for this isvc has been analyzed and presents improvement in performance:
    Average Response Time   -33.5
    Failure Count             0.0
    Compared with the prior load test, the performance is better.
    Detailed results of the current and past load tests are located in: lw_stats_history.csv.

[2024-01-19 11:41:59,109] stat1008/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                        # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     /v1/models/article-descriptions:predict                          2     0(0.00%) |   3472    3376    3568   3400 |    0.23        0.00
--------|----------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                       2     0(0.00%) |   3472    3376    3568   3400 |    0.23        0.00

Response time percentiles (approximated)
Type     Name                                                                50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     /v1/models/article-descriptions:predict                            3600   3600   3600   3600   3600   3600   3600   3600   3600   3600   3600      2
--------|--------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                         3600   3600   3600   3600   3600   3600   3600   3600   3600   3600   3600      2

2.lw_stats_history.csv: generated by lw_stats_history()

Timestamp	Request Count	Failure Count	Median Response Time	Average Response Time	Min Response Time	Max Response Time	Average Content Size	Requests/s	Failures/s	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
20240119114127	2	0	3437	3505.5	3437	3574	167.0	0.2337489125667437	0.0	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600
20240119114159	2	0	3400	3472.0	3376	3568	167.0	0.2324304732612528	0.0	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600	3600

Going to discuss this prototype with the team and we agree on what modifications are needed to fit our load testing workflow and analysis requirements.

Until now, this locust prototype has been using the same payload to run a load test on the article-descriptions model-server. Since in wrk we were using multiple payloads by reading an input file, I have updated article_descriptions.py to replicate this functionality using process_payload().

The prototype now uses the csv reader locust plugin to iterate through input data and pass it to process_payload() which generates a payload for every user request.

Document load test resultsOpen, MediumPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Document load test results
Open, MediumPublic3 Estimated Story Points
Actions

Related Objects
Search...