Page MenuHomePhabricator

Profile proton memory usage for Helm chart
Open, HighPublic

Description

Background information

T225680: Migrate Proton to k8s

What

We need to profile the proton service's memory usage in order to get appropriate values to specify in the relevant fields of the Helm chart.

How

See the example here: T220401#5128786

Acceptance criteria

  • Profiling considered production load
  • Proton's deployment chart is reviewed and accepted

Details

Related Gerrit Patches:
operations/deployment-charts : masterWIP: Proton charts first draft

Event Timeline

MSantos created this task.Nov 21 2019, 2:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2019, 2:18 PM

Change 557090 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/deployment-charts@master] WIP: Proton charts first draft

https://gerrit.wikimedia.org/r/557090

MSantos added subscribers: pmiazga, phuedx.EditedThu, Jan 9, 8:04 PM

The first tests indicate that the deployment-chart draft doesn't have the proper configuration to run the service.

My test script is hitting too many failed requests at a low RPS: 259 errors out of 279 requests at an average 0.2 RPS.

The most popular error is

HTTPError('500 Server Error: Internal Server Error for url: http://192.168.99.100:32401/en.wikipedia.org/v1/pdf/${TITLE}/a4',)

The weird thing is that the available resource is not being consumed entirely.



Additional data regarding the test script performance:


I still can't understand the underlying issue, I would appreciate some thoughts from more experienced people with the deployment-charts and the proton service. cc/ @akosiaris @pmiazga and @phuedx (thanks in advance)

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

That's exactly the case. I lack knowledge about another possible path (beta cluster?), that's why I've been benchmarking it locally.

phuedx removed a subscriber: phuedx.Thu, Jan 16, 6:54 PM
akosiaris added a comment.EditedFri, Jan 17, 10:50 AM

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

That's exactly the case. I lack knowledge about another possible path (beta cluster?), that's why I've been benchmarking it locally.

We don't currently have a good path for that, that's probably why.

I did some benchmarking as well from a test host inside the cluster. That rules out the network latency issue.

I failed to get anywhere above 0.3 RPS successfully as well. I 've used your locustfile with 10 users and a rate of spawning of 1. The application started very early on to return errors. I have the output of kubectl logs -l app=chromium-render | jq .msg| sort | uniq -c | sort -rn at P10203. I am not sure what to make of it.

akosiaris added a subscriber: Joe.Fri, Jan 17, 12:02 PM

@Joe pointed out to me that in my paste there is also "Unexpected error: Error: spawn ENOMEM" (which I should have seen). Bumping the memory limit allows to increase the locust "users" to 50 with ~1% of failure rate and a peak of ~1.3 RPS. Memory usage seems to peak at ~4GB and then the workload becomes CPU bound maxing out CPU usage. the RPS is not exactly great for the amount of resources it is consuming, but at least it's not erroring out all the time. Pics follow

CPU and memory usage of the specific pod. Note how the memory peaks at ~4GB. So a sane value for our limit here seems to be 4.5GB. Given the very low RPS the pod is able to respond to, I would say the requests (which overall should represent the average case), should be similar.

CPU wise, this is CPU bound currently. It tops at 9 CPUs (which is all I have to give currently). Every new CPU seems to provide a very mild increase RPS wise. (maybe 0.2?). I am a bit ambivalent about this. For now I 'd say let's keep both requests and limits to 5 and we 'll just add more capacity pod wise to react to traffic patterns

This is the locust charts and stats throughtout my tests. The point at ~1:32PM where the red failures drop is where I bumped the memory significantly.

FWIW, increasing the locust "users" to 100, bumped the failures to 15%, response times increased and RPS dropped below 1.

Interesting to me is a fact that I overlooked before. It's the response times. During the "good" period (the one without failures), the median response time never went below 30000ms, which is a lot.

@akosiaris interesting! Thanks for testing it, maybe the better path for the benchmark would be to deploy proton alongside a mediawiki pod and change configuration to reflect the new URL for fetching wiki pages. What do you think? Is that worth to do?

@akosiaris interesting! Thanks for testing it, maybe the better path for the benchmark would be to deploy proton alongside a mediawiki pod and change configuration to reflect the new URL for fetching wiki pages.

Overall? Yes that would at least remove the effects the latency over the internet causes.

What do you think? Is that worth to do?

Depends. If you plan to benchmark more while developing the chart, yes. Otherwise I more or less already got you the numbers.

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I assumed pod size would be determined by values-chart and "ncpu" would fill all CPU available for that pod, is that right?