Conduct basic load-test experiments for RESTRouter in k8s
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	• mobrovac
	Jun 25 2019, 5:29 PM

Description

Before we can start the deployment of RESTRouter, we need to determine the CPU and memory constraints to impose on each instance pod. This can be done by using the image that gets build as part of the pipeline job (cf. T226536) and the initial Helm chart locally in minikube. To have a proper set-up to conduct the experiments, see the benchmark wiki page as well as the P8425 script.

Because RESTRouter contacts a considerable amount of back-end services, the challenge here is to have a realistic experiment set up. To do that, I propose to set the local RESTRouter instance in such a way as to issue requests to back-end services located in Beta. This will give us a pathological worst-case scenario when it comes to memory pressure. However, an open question is whether the back-end RESTBase service in Beta should be also used. If so, then it would need to be modified (locally, in-place) to allow external requests to reach the /{domain}/v1/key_value/ hierarchy for the duration of the experiments. Alternatively, a local RESTBase back-end instance can be used for this purpose.

Once the experiments gave us some data, we should incorporate the findings into the Helm chart}(https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/512923/) by adjusting the resources needed for [requests and the respective pod limits.

Related Objects
Search...

Status	Assigned	Task
Resolved	• WDoranWMF	T220449 Split RESTBase in two services: storage service and API router/proxy
Resolved	akosiaris	T198901 Migrate production services to kubernetes using the pipeline
Resolved	akosiaris	T228676 Self-service Deployment Pipeline
Declined	akosiaris	T223953 Deploy the RESTBase front-end service (RESTRouter) to Kubernetes
Resolved	• Pchelolo	T226538 Conduct basic load-test experiments for RESTRouter in k8s
Resolved	• Pchelolo	T226536 Trigger RESTRouter image builds on push/tag

Event Timeline

• mobrovac triaged this task as High priority.Jun 25 2019, 5:29 PM

• mobrovac created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2019, 5:29 PM

• mobrovac added a parent task: T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes.Jun 25 2019, 5:29 PM

• mobrovac added a subtask: T226536: Trigger RESTRouter image builds on push/tag.

I think that for RESTBase simply going through all the endpoints is a wrong approach. We need to test up different behaviours/codepaths of RESTRouter, not the same exact codepath with different data. So, instead I'm going to go through the following behaviours:

Fetching HTML and Summary from storage - the most common codepath
Fetching HTML and Summary from storage with no-cache - most common update codepath
Fetching math formulae - very common read path
Fetching PDF - extra long backend response time
Fetching HTML and Summary with non-standard language variant - multi-step codepath with many backend services contacted
Fetching some PCS content - simple proxy with no storage and reasonably quick backend response time
Transform endpoint

I believe that these tests should be more than enough to figure out the initial limits. We can adjust as we go.

Fetching HTML from storage. C1-30, n1000

Screen Shot 2019-07-08 at 2.38.03 PM.png (1×1 px, 265 KB)

Fetching summary from storage, C1-30, n1000

Screen Shot 2019-07-08 at 2.59.40 PM.png (1×2 px, 269 KB)

Fetching HTML with no-cache, C1-5-30, n100

Screen Shot 2019-07-08 at 3.20.47 PM.png (1×1 px, 162 KB)

Fetching summary with no-cache, C1-5-30, n500

Screen Shot 2019-07-08 at 3.28.16 PM.png (972×298 px, 34 KB)

These suggest quite a strong and obvious pattern, the more we wait for backends to generate the content, the less memory/cpu we require.

These also agree quite well with what we're seeing in production with the real traffic. I will continue the experiments, but I think that 1CPU per worker and 750M per worker should be reasonable limits, to begin with judging from what production sees.

• Pchelolo closed subtask T226536: Trigger RESTRouter image builds on push/tag as Resolved.Jul 9 2019, 3:45 PM

After more load testing here's the numbers I propose with explanation:

num_workers: 2

Starting up RESTRouter takes time. Quite a long time. So, we want to lower the probability of having a dangling master, thus 2 workers.

requests.cpu=1600m

According to load testing, we have 2 fundamentally different kinds of requests - those that hit storage and those that just proxy to backend services and both have completely different CPU requirements. When we serve requests from storage, we can max out CPU (1 CPU per worker) given appropriate , while requests that are just a proxy use hardly any CPU since we're mostly waiting for IO. In production 80% of requests are served from storage, thus if we had a minimum viable number of pods (having 0 capacity over) we would've run under 80% CPU. The hard limit on possible CPU per pod is 2000m since node is single-threaded and we have 2 workers per pod. Thus, 2000*0.8=1600

requests.memory=800Mi

Currently in production the mean memory consumption is 400Mi per worker. Thus 2 workers per pod = 800Mi

limits.cpu=2

The hard limit for CPU consumption of a node worker is 1, thus 2 workers = 2.

limits.memory=1500Mi

service-runner kills RESTBase worker when 700Mi is per worker is reached. We're running 2 workers, plus a little room for master.

I did a bunch more experiments with various different endpoints from T226538#5314652 and as I thought the results are pretty much the same. I think there's no longer any value in doing even more endpoints. Let's try going with the numbers from T226538#5318448

	F29710610: Screen Shot 2019-07-08 at 2.38.03 PM.png
	Jul 8 2019, 7:30 PM

	F29711106: Screen Shot 2019-07-08 at 3.20.47 PM.png
	Jul 8 2019, 7:30 PM

	F29711237: Screen Shot 2019-07-08 at 3.28.16 PM.png
	Jul 8 2019, 7:30 PM

	F29710797: Screen Shot 2019-07-08 at 2.59.40 PM.png
	Jul 8 2019, 7:30 PM

Conduct basic load-test experiments for RESTRouter in k8sClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Conduct basic load-test experiments for RESTRouter in k8s
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...