Establish performance of the session storage service
Closed, ResolvedPublic
Actions

Description

Prior to moving to production, we should benchmark (and document) the performance of the session storage service under mixed workloads. This information should prove invaluable in validating the implementation, providing a baseline for any subsequent optimizations, and in future capacity planning.

NOTE: Current estimations put production request rates (per-datcenter) at ~30k reads/sec, ~100 writes/sec, and total data set size at ~3G (uncompressed).

Environment

Cluster
- sessionstore1001
- sessionstore1002
- sessionstore1003
- sessionstore2001
- sessionstore2002
- sessionstore2003

Each machine is a dual Intel Xeon Silver 4110 2.1G (8C, 16T) w/ 64G RAM, 2 @ 128G SSDs, and 1 gbit NIC.

NOTE: While this is the production cluster, it is otherwise not being utilized at the time of testing.

Every node runs a single instance of Cassandra 3.11.2 (6 node cluster). All Cassandra data (commitlog, sstables, etc) shares a RAID-1.

Kask is run from screen session on sessionstore1001, port 8080.

wrk is executed from sessionstore1002. A Lua script is used to create a randomized mixed workload from a pregenerated JSON-formatted file:

$ wrk --latency -t8 -c2048 -d10m -s multi-request-json.lua http://sessionstore1001.eqiad.wmnet:8080
...

Results

Threads	Concurrency	Size (k/v)	Ratio (r/w)	Throughput	50p latency	99p latency	Errors
8	1024	8/16	100:1	52610/s	20.76ms	39.01ms	0
8	2048	8/16	100:1	71899/s	37.94	245.18ms	620 (0.001%)
8	1024	32/128	100:1	52343/s	21.75ms	40.50ms	0
8	2048	32/128	100:1	67877/s	38.33ms	228.67ms	2160 (0.005%)
8	1024	32/2048	100:1	34641/s	38.28ms	140.04ms	0
8	2048	32/2048	100:1	33892/s	48.97ms	376.76ms	0
8	3072	32/2048	100:1	34832/s	72.68ms	572.14ms	773987 (3.7%)

Caveats, Observations & Comments

Errors in all cases above were the result of Cassandra timeouts
No effort has (yet) been made to tune Cassandra for this workload
Kask and wrk are co-located on two of the Cassandra nodes; Resource contention between the database, application, and benchmark-er influences the results
As noted in the comments, latency distribution would suggest that something adds ~40ms to ~1/2 of requests (see Figure 1)


Figure 1: Latency distributions

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Eevans	T206016 Create a service for session storage
Resolved	Eevans	T221292 Establish performance of the session storage service
Resolved	Eevans	T229697 Investigate Kask request latency
Resolved	Dzahn	T230178 Install wrk, siege and lua-cjson packages on deploy1001
Resolved	Eevans	T230515 Grafana dashboards for sessionstore, k8s staging, are not working

Event Timeline

Eevans created this task.Apr 17 2019, 8:54 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 17 2019, 8:54 PM

Eevans triaged this task as Medium priority.Apr 17 2019, 8:54 PM

Eevans mentioned this in T209099: Establish baseline performance of Python/WSGI frameworks.Apr 17 2019, 8:58 PM

Eevans updated the task description. (Show Details)

Details on test methodology, and some initial results are forthcoming, but to elaborate on comments made elsewhere:

wrk produces latency information that includes 50, 75, 90, and 99p quantiles, for example:

Running 10m test @ http://sessionstore1001.eqiad.wmnet:8080
  8 threads and 1028 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    23.26ms   24.29ms 472.78ms   96.41%
    Req/Sec     6.00k   611.79     7.69k    78.24%
  Latency Distribution
     50%   30.72ms
     75%   38.03ms
     90%   40.84ms
     99%   98.65ms
  28669438 requests in 10.00m, 3.23GB read
Requests/sec:  47774.78
Transfer/sec:      5.51MB

But that doesn't paint a complete picture. Looking at the distribution of request latency according to the Prometheus metrics (which are consistent with the above), you can see that ~50% of the requests are at or below 10ms (~30% are at or below 2.5ms). Most the remaining are in the 50ms range. Clearly there is something occurring in Kask that adds (an oddly consistent) 40-50ms on roughly half the requests. That will need to be figured out, but this is worth thinking about in terms of what/how we measure latencies.

Screenshot_2019-04-18 Untitled spreadsheet.png (641×1 px, 96 KB)

Eevans updated the task description. (Show Details)Apr 18 2019, 7:53 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 3:47 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 3:51 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 4:47 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 8:27 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 8:35 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 8:52 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 9:15 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 9:59 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 10:10 PM

Eevans updated the task description. (Show Details)Apr 24 2019, 10:14 PM

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Apr 25 2019, 3:48 PM

@Eevans should we break out a separate task for identifying the cause of the extra 40-50ms delay?

Also, have you broken out GET/SET/DELETE requests?

In T221292#5144480, @EvanProdromou wrote:

@Eevans should we break out a separate task for identifying the cause of the extra 40-50ms delay?

We can; If it would be helpful to you

In T221292#5144561, @EvanProdromou wrote:

Also, have you broken out GET/SET/DELETE requests?

I have not.

It seems like where your performance focus is right now, so let's call it out.

Also, is this task done when we're within the boundaries of T211721 or is the task just to have a tool to do measurements?

@Eevans maybe the mystery of the longer requests is just that they're writes instead of reads?

In T221292#5144700, @EvanProdromou wrote:

Also, is this task done when we're within the boundaries of T211721 or is the task just to have a tool to do measurements?

I honestly don't know.

T211721 ought to have been done as a prerequisite to building the service (to inform the implementation). That's why it was opened. It should have been based on our expectations for page latency, perhaps with some cost-benefit element, weighing the hit in latency we were willing to accept in exchange for the feature. Instead, it sort of devolved into an exercise of trying to guess where we'd end up, working backward to plug those numbers into the SLA. If that's the still the plan then yes, I guess we finish this issue and plug the resulting numbers into T211721 for our SLA.

In T221292#5145019, @EvanProdromou wrote:

@Eevans maybe the mystery of the longer requests is just that they're writes instead of reads?

That's not it.

In T221292#5148530, @Eevans wrote:

In T221292#5145019, @EvanProdromou wrote:

@Eevans maybe the mystery of the longer requests is just that they're writes instead of reads?

That's not it.

Cool. Looks like you've got some middleware for prometheus instrumentation in there, so I guess a next step would be figuring out if the difference shows up there, too.

In T221292#5148528, @Eevans wrote:

In T221292#5144700, @EvanProdromou wrote:

Also, is this task done when we're within the boundaries of T211721 or is the task just to have a tool to do measurements?

I honestly don't know.

OK. Let's say that it's done when you've separated out GET/POST/DELETE.

T211721 ought to have been done as a prerequisite to building the service (to inform the implementation). That's why it was opened. It should have been based on our expectations for page latency, perhaps with some cost-benefit element, weighing the hit in latency we were willing to accept in exchange for the feature. Instead, it sort of devolved into an exercise of trying to guess where we'd end up, working backward to plug those numbers into the SLA. If that's the still the plan then yes, I guess we finish this issue and plug the resulting numbers into T211721 for our SLA.

I don't think it makes sense to go backwards from what we have to what we want.

There are some numbers here. They're pretty high, but not as high as the ones you started off with. I think we're closing in on what's acceptable.

And I agree that getting those numbers sooner and clearer would have been great. That's my fault. For future services, I'm going to make sure we have performance expectations ahead of time, rather than after the implementation starts.

akosiaris mentioned this in T220401: Introduce kask session storage service to kubernetes.Jun 1 2019, 10:07 AM

• WDoranWMF removed a project: Platform Team Legacy (Next).Jul 26 2019, 5:47 PM

• WDoranWMF moved this task from Multi-DC (TEC1) to mop on the Platform Engineering board.Jul 26 2019, 6:47 PM

• WDoranWMF edited projects, added Core Platform Team Initiatives (Multi-DC (TEC1)); removed Platform Engineering (Multi-DC (TEC1)).

EvanProdromou added a project: Platform Team Workboards (Green).Jul 30 2019, 4:59 PM

Krinkle added a project: Performance-Team (Radar).Aug 2 2019, 4:44 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

• WDoranWMF moved this task from Backlog to Next Sprint on the Platform Team Workboards (Green) board.Aug 2 2019, 4:48 PM

• WDoranWMF moved this task from Next Sprint to Ready to Deploy on the Platform Team Workboards (Green) board.Aug 2 2019, 4:52 PM

Eevans mentioned this in T229697: Investigate Kask request latency.Aug 2 2019, 8:36 PM

• WDoranWMF moved this task from Ready to Deploy to Done on the Platform Team Workboards (Green) board.Sep 3 2019, 8:54 PM

CCicalese_WMF moved this task from Multi-DC (TEC1) to Session Management Service (CDP2) on the Core Platform Team Initiatives board.Mar 24 2020, 10:22 PM