Profile wikifeeds memory usage for Helm chart
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Mholloway
	Jul 29 2019, 10:14 PM

Description

We need to profile the new wikifeeds service's memory usage in order to get appropriate values to specify in the relevant fields of the Helm chart.

See the example here: T220401#5128786

Details

	Subject	Repo	Branch	Lines +/-
	Add ca-certificates package to build	mediawiki/services/wikifeeds	master	+1 -0
	First version of the wikifeeds chart	operations/deployment-charts	master	+572 -78

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	• Mholloway	T229286 Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration
Resolved	• Mholloway	T170455 Extract the feed endpoints from PCS into a new wikifeeds service
Resolved	MSantos	T229287 Profile wikifeeds memory usage for Helm chart

Event Timeline

• Mholloway renamed this task from Profile wikifeeds to Profile wikifeeds memory usage for Helm chart.Jul 29 2019, 10:14 PM

• Mholloway triaged this task as High priority.

• Mholloway created this task.

• Mholloway assigned this task to MSantos.Jul 30 2019, 4:34 PM

Change 526679 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/deployment-charts@master] WIP: First version of the wikifeeds chart

https://gerrit.wikimedia.org/r/526679

gerritbot added a project: Patch-For-Review.Jul 31 2019, 2:14 PM

• Mholloway mentioned this in T229286: Resolve service instability due to excessive event loop blockage since starting PCS response pregeneration.Jul 31 2019, 6:30 PM

MSantos moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 31 2019, 6:57 PM

@akosiaris I'm pasting here some of the questions I sent to you by email regarding the similar work you did with Kask:

Did you profile Kask locally or in the prod server? I see that you used Grafana and that made me wonder.
If you did locally, did you use helm-charts development environment?
If you did in the server, could you explain me (or point to proper documentation) how to do it?

For posterity's and transparency's sake, pasting the answer I already gave to @MSantos via email

I 've written the following as a guide on how to do that

https://wikitech.wikimedia.org/wiki/User:Alexandros_Kosiaris/Benchmarking_kubernetes_apps

It should be pretty straightforward to follow it, unless you have a
ton of endpoints to benchmark against. In that case, I suggest having
a look at http://locust.io as it can make benchmarking way easier than ab,
assuming some reading and some basic python knowledge (in fact the
graphs in T229287 are based on
that).

But to answer questions specifically as well

Did you profile Kask locally or in the prod server? I see that you used Grafana and that made me wonder.

Locally in minikube, which has grafana built-in (it's the heapster
addon you see there mentioned)

If you did locally, did you use helm-charts development environment?

nope, just minikube with a local checkout of the deployments charts
repo in order to speed up debugging (i.e. cd charts/wikifeeds ; helm
install --set key1=value .)

If you did in the server, could you explain me (or point to proper documentation) how to do it?

I sure hope the link above helps. It's meant to be moved into a more
proper place this quarter

Thanks for handling this @MSantos

Thanks @akosiaris! The tutorial really helped me with the setup.

I am now facing an odd issue that seems related to the k8s instance I'm running. When hitting some endpoints I got the following error:

{
  status: 504,
  type: "internal_http_error",
  detail: "Error: unable to get local issuer certificate",
  method: "post",
  uri: "https://en.wikipedia.org/w/api.php"
}

It looks like a service-runner requirement that is missing, @akosiaris do you have any ideas why this could be happening? cc/ @Pchelolo

In T229287#5384518, @MSantos wrote:
I am now facing an odd issue that seems related to the k8s instance I'm running. When hitting some endpoints I got the following error:
{
  status: 504,
  type: "internal_http_error",
  detail: "Error: unable to get local issuer certificate",
  method: "post",
  uri: "https://en.wikipedia.org/w/api.php"
}
It looks like a service-runner requirement that is missing, @akosiaris do you have any ideas why this could be happening? cc/ @Pchelolo

It can't verify the issuer for our certificate cause there aren't any CA certificates at all in the image.
Add ca-certificates to apt packages for the production variant (at the very least) on the blubber file, like in 1e61ca7d665.

Change 527160 had a related patch set uploaded (by MSantos; owner: MSantos):
[mediawiki/services/wikifeeds@master] Add ca-certificates package to build

https://gerrit.wikimedia.org/r/527160

Change 527160 abandoned by MSantos:
Add ca-certificates package to build

Reason:
the production image shouldn't install ca-certificates

https://gerrit.wikimedia.org/r/527160

So, I finish the setup and started some preliminary tests with the endpoint /v1/feed/onthisday, you can see it in the following image.

I'm using Apache Bench and it's still running, once is finished I will post the results here. Meanwhile, I'll write a locust.io script to perform a full test with all endpoints.

In T229287#5386764, @MSantos wrote:

So, I finish the setup and started some preliminary tests with the endpoint /v1/feed/onthisday, you can see it in the following image.

\o/

I'm using Apache Bench and it's still running, once is finished I will post the results here. Meanwhile, I'll write a locust.io script to perform a full test with all endpoints.

Nice. Let me know if you need any help

After testing a lot of scenarios here is my profiling report:

First I looked into prod statistics to set the proper load for each endpoint in the locust script P8863:

I performed tests with 750, 1500, 3000 and 6000 users, the latter had the closest req/sec rate matching production with 60 req/sec in average:

+--------+----------------------------------------------+------------+------------+------------+
| Method |                     Name                     | # requests | # failures | Requests/s |
+--------+----------------------------------------------+------------+------------+------------+
| GET    | /en.wikipedia.org/v1/feed/announcements      |       3614 |        691 |       1.48 |
| GET    | /en.wikipedia.org/v1/feed/onthisday/selected |      27494 |       6417 |      11.22 |
| GET    | /en.wikipedia.org/v1/media/image/featured    |      10217 |       3653 |       4.17 |
| GET    | /en.wikipedia.org/v1/page/featured           |      10544 |       2120 |        4.3 |
| GET    | /en.wikipedia.org/v1/page/most-read          |      10512 |       3967 |       4.29 |
| GET    | /en.wikipedia.org/v1/page/news               |       3488 |        763 |       1.42 |
| GET    | /en.wikipedia.org/v1/page/random/title       |      81186 |      16281 |      33.14 |
| None   | Total                                        |     147055 |      33892 |      60.04 |
+--------+----------------------------------------------+------------+------------+------------+

The result got 19% of failed requests, all of them with timeout errors, which is probably a problem with my internet

During the test, the maximum CPU used was 997m

During the test, the maximum memory used was 1.237GiB

Thanks for running this.

My only point is that the max CPU looks suspiciously close to 1, which is the default value in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/_scaffold/values.yaml#25. This could be artificially limiting the app and could explain the errors. If you have already set it to higher values during your benchmarking disregard the next sentence. Otherwise, you might want want to bump it (considerably, say 10) and rerun the benchmark.

Great work, thanks a lot!

In T229287#5395623, @akosiaris wrote:

Thanks for running this.

My only point is that the max CPU looks suspiciously close to 1, which is the default value in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/_scaffold/values.yaml#25. This could be artificially limiting the app and could explain the errors. If you have already set it to higher values during your benchmarking disregard the next sentence. Otherwise, you might want want to bump it (considerably, say 10) and rerun the benchmark.

Great work, thanks a lot!

I ran into that problem in the beginning, than I set the CPU limit to 2 and memory to 6GB. Do you think CPU should be even higher?

In T229287#5395752, @MSantos wrote:

In T229287#5395623, @akosiaris wrote:

Thanks for running this.

My only point is that the max CPU looks suspiciously close to 1, which is the default value in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/_scaffold/values.yaml#25. This could be artificially limiting the app and could explain the errors. If you have already set it to higher values during your benchmarking disregard the next sentence. Otherwise, you might want want to bump it (considerably, say 10) and rerun the benchmark.

Great work, thanks a lot!

I ran into that problem in the beginning, than I set the CPU limit to 2 and memory to 6GB. Do you think CPU should be even higher?

If you already did, then it's fine. With ~1cpu usage while the limit is 2 we are good to go.

@Pchelolo and @mobrovac once https://gerrit.wikimedia.org/r/526679 is landed what are the next steps?

From Audiences platform sync:

Ask Alex
Need to first get into production and beta
Then make RESTBase changes to provide a hostname for it

@akosiaris I believe the related patch is ready to land, what do you think about it?

@MSantos Had a look, found some issues. I 've also tested it with a helm install --set main_app.version=2019-08-01-194023-production . after applying my suggested changes. It seems to be working fine after that.

Change 526679 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] First version of the wikifeeds chart

https://gerrit.wikimedia.org/r/526679

akosiaris mentioned this in rDEPLOYCHARTS1a5316b6ec1d: First version of the wikifeeds chart.Sep 4 2019, 12:07 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 4 2019, 12:10 PM

Moving to sign off since this task is about profiling. The deployment of wikifeeds should be tracked here T170455: Extract the feed endpoints from PCS into a new wikifeeds service

MSantos closed this task as Resolved.Sep 18 2019, 4:12 PM

Change 527160 restored by Mholloway:
Add ca-certificates package to build

Reason:
per the comment from akosiaris

https://gerrit.wikimedia.org/r/527160

Change 527160 merged by jenkins-bot:
[mediawiki/services/wikifeeds@master] Add ca-certificates package to build

https://gerrit.wikimedia.org/r/527160

MSantos mentioned this in rWSWF5b90177f9c3e: Add ca-certificates package to build.Feb 3 2020, 10:22 PM

	F29967433: image.png
	Aug 5 2019, 8:25 PM

	F29967449: image.png
	Aug 5 2019, 8:25 PM

	F29967439: image.png
	Aug 5 2019, 8:25 PM

	F29935448: image.png
	Aug 2 2019, 12:13 AM

Profile wikifeeds memory usage for Helm chartClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Profile wikifeeds memory usage for Helm chart
Closed, ResolvedPublic
Actions

Related Objects
Search...