Page MenuHomePhabricator

Profile proton memory usage for Helm chart
Closed, ResolvedPublic

Description

Background information

T225680: Migrate Proton to k8s

What

We need to profile the proton service's memory usage in order to get appropriate values to specify in the relevant fields of the Helm chart.

How

See the example here: T220401#5128786

Acceptance criteria

  • Profiling considered production load
  • Proton's deployment chart is reviewed and accepted

Event Timeline

MSantos created this task.Nov 21 2019, 2:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 21 2019, 2:18 PM

Change 557090 had a related patch set uploaded (by MSantos; owner: MSantos):
[operations/deployment-charts@master] WIP: Proton charts first draft

https://gerrit.wikimedia.org/r/557090

MSantos added subscribers: pmiazga, phuedx.EditedJan 9 2020, 8:04 PM

The first tests indicate that the deployment-chart draft doesn't have the proper configuration to run the service.

My test script is hitting too many failed requests at a low RPS: 259 errors out of 279 requests at an average 0.2 RPS.

The most popular error is

HTTPError('500 Server Error: Internal Server Error for url: http://192.168.99.100:32401/en.wikipedia.org/v1/pdf/${TITLE}/a4',)

The weird thing is that the available resource is not being consumed entirely.



Additional data regarding the test script performance:


I still can't understand the underlying issue, I would appreciate some thoughts from more experienced people with the deployment-charts and the proton service. cc/ @akosiaris @pmiazga and @phuedx (thanks in advance)

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

That's exactly the case. I lack knowledge about another possible path (beta cluster?), that's why I've been benchmarking it locally.

phuedx removed a subscriber: phuedx.Jan 16 2020, 6:54 PM
akosiaris added a comment.EditedJan 17 2020, 10:50 AM

@MSantos where were the tests run ? And how was the draft chart deployed? My reason for asking what that if this is tested locally and talks to wikipedia.org APIs the latency added over the internet could adversely affect the benchmark.

That's exactly the case. I lack knowledge about another possible path (beta cluster?), that's why I've been benchmarking it locally.

We don't currently have a good path for that, that's probably why.

I did some benchmarking as well from a test host inside the cluster. That rules out the network latency issue.

I failed to get anywhere above 0.3 RPS successfully as well. I 've used your locustfile with 10 users and a rate of spawning of 1. The application started very early on to return errors. I have the output of kubectl logs -l app=chromium-render | jq .msg| sort | uniq -c | sort -rn at P10203. I am not sure what to make of it.

akosiaris added a subscriber: Joe.Jan 17 2020, 12:02 PM

@Joe pointed out to me that in my paste there is also "Unexpected error: Error: spawn ENOMEM" (which I should have seen). Bumping the memory limit allows to increase the locust "users" to 50 with ~1% of failure rate and a peak of ~1.3 RPS. Memory usage seems to peak at ~4GB and then the workload becomes CPU bound maxing out CPU usage. the RPS is not exactly great for the amount of resources it is consuming, but at least it's not erroring out all the time. Pics follow

CPU and memory usage of the specific pod. Note how the memory peaks at ~4GB. So a sane value for our limit here seems to be 4.5GB. Given the very low RPS the pod is able to respond to, I would say the requests (which overall should represent the average case), should be similar.

CPU wise, this is CPU bound currently. It tops at 9 CPUs (which is all I have to give currently). Every new CPU seems to provide a very mild increase RPS wise. (maybe 0.2?). I am a bit ambivalent about this. For now I 'd say let's keep both requests and limits to 5 and we 'll just add more capacity pod wise to react to traffic patterns

This is the locust charts and stats throughtout my tests. The point at ~1:32PM where the red failures drop is where I bumped the memory significantly.

FWIW, increasing the locust "users" to 100, bumped the failures to 15%, response times increased and RPS dropped below 1.

Interesting to me is a fact that I overlooked before. It's the response times. During the "good" period (the one without failures), the median response time never went below 30000ms, which is a lot.

@akosiaris interesting! Thanks for testing it, maybe the better path for the benchmark would be to deploy proton alongside a mediawiki pod and change configuration to reflect the new URL for fetching wiki pages. What do you think? Is that worth to do?

@akosiaris interesting! Thanks for testing it, maybe the better path for the benchmark would be to deploy proton alongside a mediawiki pod and change configuration to reflect the new URL for fetching wiki pages.

Overall? Yes that would at least remove the effects the latency over the internet causes.

What do you think? Is that worth to do?

Depends. If you plan to benchmark more while developing the chart, yes. Otherwise I more or less already got you the numbers.

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I assumed pod size would be determined by values-chart and "ncpu" would fill all CPU available for that pod, is that right?

I just noticed that num_workers: ncpu in the chart. Sigh, this probably makes all CPU calculations wrong as it is impossible to size a pod that is dependent on the underlying hardware CPU wise. I 'll have to rerun those tests with values between 1,2,3

I assumed pod size would be determined by values-chart and "ncpu" would fill all CPU available for that pod, is that right?

Unfortunately no :(. The limits values in the chart are there as the max levels that the kernel will allow a container to consume (it will kill/throttle the pod if those are not honored). However, the pod is able to know the hardware that it is running on, e.g. it knows there are 10 cpus in the system even if the kernel's scheduler will allow it to consume only 1. Same goes for memory (but some languages have better support for that, e.g. Java https://blogs.oracle.com/java-platform-group/java-se-support-for-docker-cpu-and-memory-limits).

akosiaris added a comment.EditedJan 20 2020, 11:16 AM

I 've rerun the benchmark against values of 1,2,3 for num_workers

num_workers: 1 diagrams

num_workers: 2 diagrams

num_workers: 3 diagrams

What do we get out of it.

  • In all cases, after a number of "users"/requests per sec has been reached the service crumbles. Response times all skyrocket around 60secs (which is the timeout configured anyway) and a lot of responses fail. That is expected, we wanted to find out that threshold with this process. That being said, we want to avoid the pods reaching that state in production.
  • The amount of "users" (or more concretely requests per second ) that can be reliably serviced doesn't not rise proportionally (it does rise somewhat however) with the increase of available CPUs. I think this is due to the amount of concurrent render processes (3 in the configuration) that can happen, which acts as an chokepoint (I might be wrong).
  • The amount of requests per sec that can be reliably serviced per pod before everything crumbles fluctuates considerably. If we account for some error in the benchmarks and round up/down a bit, regardless of the number of CPUs we seem to fluctuate around the number 0.5
  • https://grafana.wikimedia.org/d/000000563/proton?orgId=1 puts the peak (averaged over 90 days though) requests rate at ~2. That means that we 4 pods we should be able to handle the current load. Let's target at least twice that (so 8 pods to be on the safe side).
  • CPU and memory usage is a tad weird. Having 2 num_workers indeed uses roughtly twice the CPU and memory than having 1 num_worker does, as expected. num_workers=3 doesn't however. I 'll re-run those to verify, but 2 is probably a better choice anyway, mostly due to keeping the ability to schedule this pod even on really small clusters, e.g. minikube/docker desktop.
  • Using ncpu 2, it looks like we want: 8cpu, 3GB as limits. The rule about limits matching requests still seems to hold true. At so low RPS per pod it does make sense.

It looks like num_workers = 2 performed better if we consider failure vs RPS.

I did rerun 2 times the num_workers=3 test. No big diff. 100 "locust users", spawned at a rate of 0.1/s. After about peaking at about 0.5 RPS, errors start happening and latency skyrockets at ~60s. CPU still is around 3K. Memory wise it has gone up to 4GB after about 1.5h of benchmarking. Funny thing is that the memory usage is not plateauing at all, but rather keep on increasing. This is I guess expected given we use chromium which is known for being a memory hog. Kubernetes will anyway take care of the memory leaks by restarting the pod if it goes over usage.

Nemo_bis added a subscriber: Nemo_bis.

Change 557090 merged by jenkins-bot:
[operations/deployment-charts@master] Add chart for chromium-render

https://gerrit.wikimedia.org/r/557090

Change 577546 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] chromium-render: Package and release 0.0.1

https://gerrit.wikimedia.org/r/577546

Change 577546 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] chromium-render: Package and release 0.0.1

https://gerrit.wikimedia.org/r/577546

akosiaris closed this task as Resolved.Mar 6 2020, 12:22 PM

https://releases.wikimedia.org/charts/ \o/

Resolving, I 'll track the creation of namespaces in parent task

akosiaris updated the task description. (Show Details)Mar 6 2020, 12:23 PM