Page MenuHomePhabricator

Introduce kask session storage service to kubernetes
Open, NormalPublic

Description

Subtask of the TEC3:O3:O3.1:Q4 Goal to introduce kask to use the deployment pipeline

Event Timeline

akosiaris created this task.Apr 8 2019, 2:42 PM
akosiaris triaged this task as Normal priority.
Dzahn added a subscriber: Dzahn.Apr 16 2019, 11:48 PM

Change 505263 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] First version of the kask chart

https://gerrit.wikimedia.org/r/505263

I did some benchmarking and here's some first (rather impressive numbers) for kask

This is with 750 simultaneous simulated users via locust.io python framework. P8425 holds the code.

It's an artificial benchmark of using ascii strings of 32 chars for both key and value with a rate of 94% GET, 5% POST and 1% DELETE requests. The ratios are by my own RNG, I am guessing they are rather close to reality but we can of course alter them. There's a very small number of monitoring requests as well (ratio 1/50 for monitoring/normal).
The service is a single instance of kask and a single, untuned, straight out of the box, cassandra instance where the only actions taken were to create the schema (keyspace+table)

The service with those 750 simultaneous simulated users hasn't yet returned a failure for any kind of request. I 'll be pushing it to see when it does, but by the looks of it if won't be due to service but rather due to cassandra. Remains to be seen.

CPU wise we have the following graph

Mem wise the following

And locust provides us with the following

Overall, way before we reached the artificial 750 users number (actually at around 450), the service reached a stable mean of successfully ~300 req/s. There are spikes of up to 350req/s but these are followed by lows of 250req/s creating a saw like pattern. The latency follows a similar pattern which is consistent with the above behavior.

Memory usage barely reached 45MB, CPU usage barely reached 130m core.

I think we have the numbers we need for the helm chart.

After that, response latency starts increasing gradually with the amount of users added, without

Just for posterity's sake, at ~1500 artificially simulated users the service started to crumble and started returning

HTTPError('500 Server Error: Internal Server Error for url: http://192.168.99.100:57958/sessions/v1/xwdmubncprzxtndfcyffnttsqzpsiulb',)

with the service logging

{"msg":"Error reading from storage (gocql: no response received from cassandra within timeout period)","appname":"sessions","time":"2019-04-22T19:19:29Z","level":"ERROR","request_id":"00000000-0000-0000-0000-000000000000"}

This is fine, that's why we ran this benchmark. The numbers are pretty impressive already. Memory and CPU usage (which is why this entire benchmark was done) did not really change (barely 60MB of memory usage, CPU remained largely the same). The number of req/s successfully served peaked at ~400.

For production, we should be keeping in mind that each pod should not be serving > 250 req/s if we want to be in optimal state, with low latency and some room to spare.

Change 505263 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] First version of the kask chart

https://gerrit.wikimedia.org/r/505263

Change 506104 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Publish the kask chart in the repo

https://gerrit.wikimedia.org/r/506104

Change 506104 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Publish the kask chart in the repo

https://gerrit.wikimedia.org/r/506104

Change 506110 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/kask@master] Add helm.yaml file for use by the pipeline

https://gerrit.wikimedia.org/r/506110

@Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to the deployment is the swagger/openapi spec so that service-checker[1] can run and monitor this service.

[1] https://github.com/wikimedia/operations-software-service-checker

Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Thu, Apr 25, 3:48 PM

@Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to the deployment is the swagger/openapi spec so that service-checker[1] can run and monitor this service.

[1] https://github.com/wikimedia/operations-software-service-checker

Thanks @akosiaris; I'll take care of that presently!

fsero added a subscriber: fsero.Mon, Apr 29, 4:07 PM

Change 507531 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/software/service-checker@master] Handle application/octet-stream requests properly; release v0.1.5

https://gerrit.wikimedia.org/r/507531

Change 509102 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] kask: Add incubator/cassandra subchart

https://gerrit.wikimedia.org/r/509102

Change 509102 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] kask: Add incubator/cassandra subchart

https://gerrit.wikimedia.org/r/509102

@Clarakosi @Eevans. I 've updated the chart to also conditionally install a minimal cassandra for use in minikube. In my tests I was able to use and even run some benchmarks on it. On our side we are ready to proceed with deployment to staging and then production as soon as https://gerrit.wikimedia.org/r/#/c/mediawiki/services/kask/+/507397/ is merged.

I do have one question: We have a sessionstore cassandra ready for production, but what about staging? I don't as far as I know. So what do we do?

Couple of ideas:

  • Reuse the current session store cassandra instances, albeit use a different keyspace.
  • Setup new cassandra instances (in the -b, -c manner we have for other stuff) on the sessionstore hosts.
  • Some variation of the above with some other cassandra hosts and not the sessionstore ones.

Staging currently is not exposed to the public (nor are there any immediate plans to do so) and sees 0 traffic. It exists just to perform an upgrade of the software on it right before performing it in production therefore hopefully increasing the confidence that an upgrade is not going to melt the world. So, load wise I am not particularly worried.

It also mimics production very closely (unlike say beta/deployment-prep) so it has the same security guarantees and assumptions, so neither of these factors should be a consideration either.

Opinions?

Change 507531 merged by Giuseppe Lavagetto:
[operations/software/service-checker@master] Handle application/octet-stream requests properly; release v0.1.5

https://gerrit.wikimedia.org/r/507531

@Clarakosi @Eevans. I 've updated the chart to also conditionally install a minimal cassandra for use in minikube. In my tests I was able to use and even run some benchmarks on it. On our side we are ready to proceed with deployment to staging and then production as soon as https://gerrit.wikimedia.org/r/#/c/mediawiki/services/kask/+/507397/ is merged.

... aaaand when service-checker v0.1.5 is packaged and in our APT :)

I do have one question: We have a sessionstore cassandra ready for production, but what about staging? I don't as far as I know. So what do we do?

@Joe suggested using the restbase-dev Cassandra cluster, which I'd +1.

Change 510743 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/docker-images/production-images@master] Bump service-checker docker container image

https://gerrit.wikimedia.org/r/510743

Mentioned in SAL (#wikimedia-operations) [2019-05-16T15:59:13Z] <akosiaris> build service-checker OCI container 0.0.2 with 0.1.5 service-checker version T220401

Change 510743 merged by Alexandros Kosiaris:
[operations/docker-images/production-images@master] Bump service-checker docker container image

https://gerrit.wikimedia.org/r/510743