Page MenuHomePhabricator

Reduce performance impact of a large number of concurrent users
Closed, ResolvedPublicSpike

Description

When we deploy our eligible-user notification (T132084), thousands of users will receive a notification directing them to the library. This could be as many as a thousand users per day. We need to evaluate the potential performance impact of this, and whether there are any high priority performance improvements we should make in advance.

Here, high priority refers to issues that might impede users' ability to access or navigate the tool. We don't need a list of every potential performance improvement - just focus on the high priority workflows (e.g. logging in, accessing and using My Library, navigating to resources through EZProxy).

Questions to explore

  • Will the Library Card be able to support dozens or even hundreds of concurrent users?
  • Are there obvious performance bottlenecks which might cause us to run into problems?

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptApr 1 2021, 10:41 AM
Samwalton9 renamed this task from Evaluated expected performance impact of a large amount of concurrent users [4hr] to Evaluate expected performance impact of a large amount of concurrent users [4hr].Jun 9 2021, 1:01 PM

WIP PR: https://github.com/WikipediaLibrary/TWLight/pull/789

I believe I got myself tightly rate-limited while working on this, due to the traffic I generated against meta during login. Oopsie.

jsn.sherman renamed this task from Evaluate expected performance impact of a large amount of concurrent users [4hr] to test performance impact of a large amount of concurrent users.Aug 17 2021, 6:25 PM
jsn.sherman removed a project: Spike.

Update:
I've moved this away from a spike since I'm now into the "writing code to test this" phase.
Today I've been working on managing the session state between logging in on meta and having a session on the platform.
There is still plumbing and debugging to do there, as we're not staying logged consistently, but I'm making progress.

today's update

I have:

  • the test code successfully using the front page login url, loggging in via meta, and then applying to all partners.
  • a few other less interesting tasks in place

I'm working on:

  • adding guard clauses to make sure it stops in various places if it finds itself redirected to meta unexpectedly
  • withdrawing applications

I'm planning on:

  • allowing for an arbitrary set of user accounts
  • having the user use filters in my_library
  • posting comments on applications

questions:

  • are there any other specific editor worfklows the team would like to see tested?
  • For ezproxy access, we'd need to make some eligible test accounts, maybe by adding a boolean for forcing eligibility for an account. Other thoughts about how to tackle that?

Okay, I now have:

  • allowing for an arbitrary set of user accounts
  • failures if it finds itself redirected to meta unexpectedly
  • withdrawing applications

withdrawing applications and following ezproxy urls both require html parsing, so I'm streaming and parsing those pages per-line to keep client memory usage in check since we want to spin up a bunch of these simultaneously.

I think I'll skip the commenting on applications worfklow for now so that I can move on to the ezproxy bits that I can work on while waiting to hear back from oclc.

Okay, the test now follows ezproxy access urls and downloads the page after all the redirects finish. It doesn't download any resources from that page, or crawl links, but I figured that will be a good place to stop until after an initial test.
I heard back from OCLC and they verified that the prepod environment uses separate (but identically specced) resources from production, meaning that it will be a good target for the load test on staging.

I was pulling my hair out trying to figure out why some of my tests weren't working properly this morning.
It turns out I had crossed some thresholds for requiring captcha input for login, which of course my test doesn't even try to handle.
Manually logging in on the impacted account and filling out the captcha resolves the issue for the moment.
The real fix would include improving my failed login detection, and probably session storage so that I'm not logging back in each time I start a test.
For now, we'll just try to skate past so we can run the test today.

Okay, after numerous little issues, we were able to run a series of tests against staging today.
It's clear that staging needs some configuration changes to fully utilize the system resources, as we're only able to service about 5 requests per second, and we're only loading up 1 gunicorn worker. This is not the behavior on production (just looking at the process list), so I'll go through and identify and make any necessary configuration changes to bring staging to parity.

A few updates:

  • I have resolved numerous configuration differences, documentation problems, and even bugs with our staging environment. Chalk it up to technical debt leftover from doing a breakfix redeployment of production last year without following up with measured documentation session and staging redeployment to verify things.
  • I identified the source of some of the errors that were being caught yesterday, and all of the accounts I'm using for testing are now being flagged for captcha completion which results in a soft failure. I'm at least catching those now so that we are properly recording what's happening. I currently allow the tests to continue just so we can generate some load for now. We probably want to find out if we can get some accounts or ip addresses whitelisted to avoid captcha during the test.

another option for the captcha would be to present the images on the terminal and pause for user input and to include the input in the locust login form post. It would probably take a day for me to sort out properly.

I just watched a new deployment roll across staging while I was running a load test.

The good news is that those deployment-related processes (such as backups) are running at a higher priority than the web workers, so having a fully loaded system will not break backups.

The bad news is that something has to give in that situation, so 502 errors were returned.

Samwalton9 renamed this task from test performance impact of a large amount of concurrent users to Test performance impact of a large amount of concurrent users.Sep 21 2021, 9:23 AM
Samwalton9 updated the task description. (Show Details)

Just noting here that per T290580 we now have an increased quota to do further testing.

Samwalton9 renamed this task from Test performance impact of a large amount of concurrent users to Test performance impact of a large number of concurrent users.Sep 23 2021, 11:11 AM

Per T292100 we can now try resizing staging with the available instance flavors

  • resized staging instance to g3.cores16.ram16.disk20 and verified that it's happy
  • rebased branch and rebuilding staging docker image, will resume testing when we have that updated image deployed.

Updates:
Our consistentlly slowest routes that we can control and have been testing are /users/my_library/ and /partners/, so that's what I'll mostly talk about here.

You can see the gunicorn worker values I landed on at the end of last week here:
https://github.com/WikipediaLibrary/TWLight/pull/789/commits/dc63b59f278daee87e9f9eaf44c099df87c18a42
I basically stopped using multiple threads per process because it's not particularly performant due to the GIL. The tradeoff is more memory usage, which the current instance size supports

With 100 simultaneous users, this got us the following average load times (ms):

  • /partners/: 2695
  • /users/my_library/: 3021

To put this into perspective, I'm getting an average 2500 ms response time for logging into meta on these 100 user tests

We were only sustaining about 70% CPU usage with this configuration, and I couldn't get it to go any higher, so today I've been looking at the database side of things.
I initially tried increasing the thread concurrency for the innodb process, but found that the extra workers weren't getting fed, so there was no reason to set values there yet.

innodb_thread_concurrency = 0
innodb_read_io_threads = 8
innodb_write_io_threads = 8

So I then setup the debug toolbar, using Scardenasmolinar's work in wikilink as a guide
https://github.com/WikipediaLibrary/TWLight/pull/789/commits/5568ed9ebce8ca515fa5d777125379fce5fcdb61

I found that /users/my_library/ and /partners/ were firing hundreds of queries, many of them dependent (meaning query 2, might depend on the response to query 1) , so I did a first pass to slim those down with additional foreignkey selects and manytomany prefetching.
In practice, this results in fewer, more complex queries, without any wait time for dependent data.
As an example, my_library previously ran over 300 queries, and is currently down to 91.

https://github.com/WikipediaLibrary/TWLight/pull/789/commits/6d02295e86eb01f33286f0f40d8e166a6d81cefe
This brought us down to:

  • /partners/: 2039
  • /users/my_library/: 2219

I'm not too worried about partners right now, but will still include it in results. I'm am going to drop the proxied urls from the tests, as the side effect of including them is to practically reduce the amount of load I'm putting on the platform and to slow down each test iteration.

My next steps will be:

  • One more pass on /users/my_library/ queries since that's a hot path for authenticated users
  • A first pass on authenticated caching, most likely within django
  • A followup look at the system load to see if we can either use more CPU or trim back the instance size while maintaining performance

Thanks for the update, this is all super helpful for understanding what's happening :)

As you note, /partners/ isn't a critical path for us. Users coming to the site from the notification have no clear links to navigate there. It's primarily a legacy path for some inbound links for users who aren't eligible. my_library is definitely the priority view.

I completed a subsequent db performance pass on /users/my_library/ and now have it down to 22 queries and rendering the page in about 1/10 of the time that it did initially on my local machine.
I'm doing a db restore on staging and will be doing some manual testing to verify everything still looks right. Then I'll do another load test to see where we're at before starting on caching.

... that is I will be doing testing on staging after I figure out how I've broken travis builds on this branch. I'm guessing its related to the quick and dirty way I added in the django debug toolbar.

@jsn.sherman I might have an idea why Travis is broken. Working on a PR now

We verified that @Scardenasmolinar was having a separate issue, and I now have staging images building correctly again. It was due to the way I was setting up the debug_toolbar, which I have now changed.
I just got through another round of load testing with 100 users, and the numbers are encouraging. Note that I did actually add a simple but high impact improvement to the partners view, as it was a very obvious (after working on my_library) 1 line change:

Method 	Name 					Average (ms)
GET 	/partners/				1524
GET 	/users/my_library/			1771
POST 	https://meta.wikimedia.org/w/index.php	2521

On to caching!

Merged https://github.com/WikipediaLibrary/TWLight/pull/789, but keeping this in the review column because more work needs to be done (T294023)

Results of testing latest caching code:

Method 	Name 					Average (ms)
GET 	/partners/				1162
GET 	/users/my_library/			1008
POST 	https://meta.wikimedia.org/w/index.php	3292

I ran this test long enough to make sure there were plenty of cache timeouts along the way, meaning there is a good mix of cold and warm cache loads represented.

jsn.sherman renamed this task from Test performance impact of a large number of concurrent users to Reduce performance impact of a large number of concurrent users.Oct 26 2021, 7:34 PM

I'm moving this to review since there seems no more work to be done and we just need to check that production is responding well to all the performance changes and its scaling