Perform load-testing in Beta
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• sdkim
	Aug 19 2020, 3:33 PM

Description

Open questions

How different is to test in the beta cluster and locally, like we did for the initial benchmark? https://phabricator.wikimedia.org/T250493
Are we testing with Dry Runs? No

Acceptance criteria

Retrieve benchmarks to compare the performance test against
Re-do the kubernetes benchmarking LOCALLY
Re-do the kubernetes benchmarking against BETA clusters
Include APNS url

Out of Scope

Not including echo

Notes

We can potentially use the script that Mateus used on a previous beta test.

Conclusion

See reports T260807#6432860 and T260807#6459980

A follow-up patch on deployment-charts is to be pushed with the needed changes to run the service with the prometheus metrics collections and the new configurations.

No follow-up issues were found during tests.

Details

	Subject	Repo	Branch	Lines +/-
	push-notif: drop support to statsd-exporter	operations/deployment-charts	master	+10 -99

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• sdkim	T260806 [Push-Notifications] Performance Review Follow-Up Tasks
		Resolved		MSantos	T260807 Perform load-testing in Beta

Event Timeline

• sdkim created this task.Aug 19 2020, 3:33 PM

• sdkim removed a project: Epic.

• sdkim updated the task description. (Show Details)Aug 19 2020, 4:25 PM

• sdkim mentioned this in T260809: Test resiliency against misbehaving push provider APIs.Aug 19 2020, 4:32 PM

• sdkim triaged this task as High priority.Aug 19 2020, 4:35 PM

• sdkim moved this task from Needs triage to Kanban on the Product-Infrastructure-Team-Backlog-Deprecated board.Aug 19 2020, 4:38 PM

• sdkim edited projects, added Product-Infrastructure-Team-Backlog-Deprecated (Kanban); removed Product-Infrastructure-Team-Backlog-Deprecated.

Regarding prometheus on beta:

There is a prometheus node running in the beta cluster
Push notification instance is running the node export for system metrics
There is network flow from prometheus to push-notifications for the node metrics
There is no scrape_config for nodejs this means it needs some extra config to be able to pull metrics from our push-notifications service
In case we add a new exporter for the service specific metrics we need to change the sg rules to allow traffic to the exporter

MSantos claimed this task.Aug 24 2020, 10:19 AM

MSantos moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

The load test locally is finished, here are the data collected:

Queue time was in average 400 ms and very consistent

Queue size on flush was in average 4009

The transaction time with APNS and FCM was almost constant at 200ms for both providers

The test included sending a small amount of fake tokens to force send failure, but FCM metrics showed a weird behavior of duplicated counts for success and failure

The service performed well at 400 req/s where it saturated, no HTTP error was registered at this rate

In the tests there were a spike in latency at the beginning and end of the script, but it kept an average of 300ms overall, p99 at 1s and p90 at 700ms

CPU and memory stabilized above the established limits, the memory needed increased by 100Mi and CPU needed decreased by 1000m

A follow-up patch on deployment-charts is to be pushed with the needed changes to run the service with the prometheus metrics collections and the new configurations.

MSantos added subscribers: jijiki, akosiaris.Sep 3 2020, 10:02 AM

Change 624012 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/deployment-charts@master] push-notif: drop support to statsd-exporter

https://gerrit.wikimedia.org/r/624012

gerritbot added a project: Patch-For-Review.Sep 3 2020, 10:34 AM

Change 624012 merged by jenkins-bot:
[operations/deployment-charts@master] push-notif: drop support to statsd-exporter

https://gerrit.wikimedia.org/r/624012

MSantos updated the task description. (Show Details)Sep 14 2020, 6:32 PM

So, I tried to configure Prometheus and Grafana for the beta cluster, but without success. I was still able to perform the load test, but only able to collect the metrics available in Locust.

The latency results do not reflect production because they will be made as internal requests for each cluster. But, the tests were consistent with a similar error rate and request/s. Here are 2 reports considering different loads:

138.7 req/s: