Page MenuHomePhabricator

Investigate Kask request latency
Closed, ResolvedPublic8 Estimated Story Points

Assigned To
Authored By
Eevans
Aug 2 2019, 8:36 PM
Referenced Files
F30391106: image.png
Sep 18 2019, 8:43 PM
F30210590: image.png
Sep 3 2019, 6:14 PM
F30042793: requests_posts-only.json.gz
Aug 16 2019, 6:13 PM
F30042791: requests_gets-only.json.gz
Aug 16 2019, 6:13 PM
F30042790: requests.json.gz
Aug 16 2019, 6:13 PM
F30031302: pprof002.svg
Aug 14 2019, 11:40 PM
F30031299: pprof001.svg
Aug 14 2019, 11:40 PM
F30004379: pprof002.svg
Aug 9 2019, 4:19 PM

Description

T221292: Establish performance of the session storage service demonstrated that ~50% of requests complete in the low single-digit millisecond range, while the remaining 50% are several tens of milliseconds. It seems reasonable to speculate that something aberrant occurs in the latter case, and that latency in the 2.5-5ms range are possible.


Performance testing/Load generation:

NameFileDescription
wrk.shP8926Wrapper script for wrk
multi-request-json.luaP8927Lua script to read requests from a requests.json file
requests.jsonMixed workload (100:1) reads:100000, writes:1000
requests_gets-only.jsonRead requests 100000
requests_posts-only.jsonWrite requests 1000

Completion criteria:

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+2 -0
mediawiki/services/kaskmaster+21 -7
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+10 -10
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/kaskmaster+7 -10
mediawiki/services/kaskmaster+0 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/kaskmaster+10 -7
operations/deployment-chartsmaster+0 -5
operations/deployment-chartsmaster+5 -0
operations/deployment-chartsmaster+1 -1
mediawiki/services/kaskmaster+1 -0
mediawiki/services/kaskmaster+121 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 530220 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] Deploy 2019-08-14-210839-production Docker image

https://gerrit.wikimedia.org/r/530220

Change 530220 merged by Eevans:
[operations/deployment-charts@master] Deploy 2019-08-14-210839-production Docker image

https://gerrit.wikimedia.org/r/530220

Change 530223 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] sessionstore: (Temporarily )use HTTP for liveness

https://gerrit.wikimedia.org/r/530223

Change 530223 merged by Eevans:
[operations/deployment-charts@master] sessionstore: (Temporarily )use HTTP for liveness

https://gerrit.wikimedia.org/r/530223

Here is information from a run in staging where the unusual latency distribution DOES manifest.

$ KASK_URL=http://127.0.0.1:33411 WRK_ARGS="--latency -d20m" ~eevans/wrk.sh 
+ WRK_ARGS=--latency -d20m
+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
+ KASK_URL=http://127.0.0.1:33411
+ dirname /home/eevans/wrk.sh
+ cd /home/eevans
+ [ ! -f multi-request-json.lua ]
+ [ ! -f requests.json ]
+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d20m -s multi-request-json.lua http://127.0.0.1:33411
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
Running 20m test @ http://127.0.0.1:33411
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    65.69ms   63.31ms 403.77ms   84.69%
    Req/Sec    90.07     52.03   320.00     68.47%
  Latency Distribution
     50%   83.64ms
     75%   98.82ms
     90%  144.10ms
     99%  210.08ms
  212649 requests in 20.00m, 42.93MB read
  Non-2xx or 3xx responses: 84413
Requests/sec:    177.20
Transfer/sec:     36.63KB
------------------------------
10%,3000
20%,3332
30%,3721
40%,4653
50%,83645
60%,91501
70%,96028
80%,102524
90%,144105
99%,210082
duration (micros),1200061302
requests,212649
bytes,45019107
connect errors,0
read errors,0
write errors,0
status errors,84413
timeout errors,0
------------------------------
Graph
Graph w/ context-sensitive call tree
1(pprof) top20
2Showing nodes accounting for 23.29s, 60.90% of 38.24s total
3Dropped 420 nodes (cum <= 0.19s)
4Showing top 20 nodes out of 168
5 flat flat% sum% cum cum%
6 12.04s 31.49% 31.49% 12.72s 33.26% syscall.Syscall
7 2.72s 7.11% 38.60% 2.72s 7.11% runtime.futex
8 2.32s 6.07% 44.67% 2.32s 6.07% runtime.epollwait
9 1.07s 2.80% 47.46% 2.08s 5.44% runtime.mallocgc
10 0.89s 2.33% 49.79% 0.89s 2.33% runtime.usleep
11 0.46s 1.20% 50.99% 0.46s 1.20% runtime.heapBitsSetType
12 0.46s 1.20% 52.20% 0.89s 2.33% runtime.mapassign1
13 0.40s 1.05% 53.24% 0.40s 1.05% runtime._ExternalCode
14 0.34s 0.89% 54.13% 1.11s 2.90% runtime.selectgoImpl
15 0.31s 0.81% 54.94% 0.32s 0.84% runtime.mapiternext
16 0.31s 0.81% 55.75% 0.31s 0.81% runtime.memmove
17 0.29s 0.76% 56.51% 0.29s 0.76% runtime/internal/atomic.Xchg
18 0.26s 0.68% 57.19% 0.52s 1.36% runtime.deferreturn
19 0.23s 0.6% 57.79% 1.52s 3.97% runtime.newobject
20 0.22s 0.58% 58.37% 0.22s 0.58% runtime.procyield
21 0.21s 0.55% 58.92% 0.66s 1.73% runtime.lock
22 0.21s 0.55% 59.47% 0.21s 0.55% runtime.memeqbody
23 0.19s 0.5% 59.96% 0.20s 0.52% runtime.freedefer
24 0.19s 0.5% 60.46% 0.21s 0.55% runtime.mapaccess2_faststr
25 0.17s 0.44% 60.90% 0.20s 0.52% runtime.getitab
26(pprof)
1(pprof) top20
21460ms of 2290ms total (63.76%)
3Dropped 113 nodes (cum <= 11.45ms)
4Showing top 20 nodes out of 178 (cum >= 330ms)
5 flat flat% sum% cum cum%
6 500ms 21.83% 21.83% 530ms 23.14% syscall.Syscall
7 220ms 9.61% 31.44% 220ms 9.61% runtime.futex
8 120ms 5.24% 36.68% 120ms 5.24% runtime.usleep
9 110ms 4.80% 41.48% 110ms 4.80% runtime.epollwait
10 70ms 3.06% 44.54% 440ms 19.21% runtime.findrunnable
11 50ms 2.18% 46.72% 50ms 2.18% sync.(*Mutex).Lock
12 40ms 1.75% 48.47% 80ms 3.49% runtime.runqgrab
13 30ms 1.31% 49.78% 30ms 1.31% memeqbody
14 30ms 1.31% 51.09% 30ms 1.31% runtime.casgstatus
15 30ms 1.31% 52.40% 90ms 3.93% runtime.mallocgc
16 30ms 1.31% 53.71% 30ms 1.31% runtime.memmove
17 30ms 1.31% 55.02% 90ms 3.93% runtime.newobject
18 30ms 1.31% 56.33% 30ms 1.31% runtime.runqempty
19 30ms 1.31% 57.64% 110ms 4.80% runtime.runqsteal
20 30ms 1.31% 58.95% 80ms 3.49% runtime.selectgo
21 30ms 1.31% 60.26% 40ms 1.75% time.now
22 20ms 0.87% 61.14% 20ms 0.87% crypto/aes.encryptBlockAsm
23 20ms 0.87% 62.01% 20ms 0.87% encoding/json.stateInString
24 20ms 0.87% 62.88% 320ms 13.97% github.com/gocql/gocql.(*Conn).executeQuery
25 20ms 0.87% 63.76% 330ms 14.41% internal/poll.(*FD).Read
26(pprof)
local resultsk8s staging

Frustrated with how difficult it is to iterate debugging on k8s, I attempted to replicate the latency issue using an instance running directly on the restbase-dev Cassandra cluster. This was running outside of a container on the same Cassandra cluster that staging (k8s) connects to. It (unfortunately) does not manifest the aberrant latency distribution. I've included the results below for posterity sake.

eevans@restbase-dev1006:~$ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -t 4 -c 100 -d3m -s multi-request-json.lua http://127.0.0.1:8080
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
multiplerequests: Found 101000 requests
Running 3m test @ http://127.0.0.1:8080
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.83ms    1.63ms  66.93ms   88.10%
    Req/Sec     8.95k   511.49    12.87k    80.88%
  Latency Distribution
     50%    2.50ms
     75%    3.35ms
     90%    4.39ms
     99%    7.52ms
  6415933 requests in 3.00m, 807.00MB read
Requests/sec:  35634.22
Transfer/sec:      4.48MB
------------------------------
10%,1546
20%,1815
30%,2040
40%,2261
50%,2502
60%,2789
70%,3131
80%,3607
90%,4388
99%,7519
duration (micros),180049744
requests,6415933
bytes,846204843
connect errors,0
read errors,0
write errors,0
status errors,0
timeout errors,0
------------------------------
eevans@restbase-dev1006:~$
Eevans updated the task description. (Show Details)

Since most of the wrk runs in this ticket used a mixed workload, here it is isolated by request type for posterity sake (hint: it changes nothing):

1eevans@deploy1001:~$ KASK_URL=http://localhost:34311 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=http://localhost:34311
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua http://localhost:34311
10multiplerequests: Found 1000 requests
11multiplerequests: Found 1000 requests
12multiplerequests: Found 1000 requests
13Running 10m test @ http://localhost:34311
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 54.96ms 55.76ms 497.10ms 90.02%
17 Req/Sec 111.47 62.13 460.00 64.93%
18 Latency Distribution
19 50% 48.93ms
20 75% 93.01ms
21 90% 110.66ms
22 99% 205.83ms
23 131241 requests in 10.00m, 10.01MB read
24Requests/sec: 218.71
25Transfer/sec: 17.09KB
26------------------------------
2710%,2819
2820%,3143
2930%,3536
3040%,4713
3150%,48932
3260%,60668
3370%,90905
3480%,95905
3590%,110662
3699%,205833
37duration (micros),600071166
38requests,131241
39bytes,10499280
40connect errors,0
41read errors,0
42write errors,0
43status errors,0
44timeout errors,0
45------------------------------
46eevans@deploy1001:~$

1eevans@deploy1001:~$ KASK_URL=http://localhost:34311 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=http://localhost:34311
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua http://localhost:34311
10multiplerequests: Found 100000 requests
11multiplerequests: Found 100000 requests
12multiplerequests: Found 100000 requests
13Running 10m test @ http://localhost:34311
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 50.22ms 51.49ms 404.52ms 89.63%
17 Req/Sec 126.74 62.07 373.00 60.20%
18 Latency Distribution
19 50% 49.97ms
20 75% 90.35ms
21 90% 102.83ms
22 99% 203.35ms
23 148969 requests in 10.00m, 18.75MB read
24Requests/sec: 248.24
25Transfer/sec: 32.00KB
26------------------------------
2710%,2889
2820%,3209
2930%,3584
3040%,4487
3150%,49970
3260%,54377
3370%,84301
3480%,92389
3590%,102833
3699%,203354
37duration (micros),600095264
38requests,148969
39bytes,19663908
40connect errors,0
41read errors,0
42write errors,0
43status errors,0
44timeout errors,0
45------------------------------
46eevans@deploy1001:~$

Change 530621 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Revert "(Temporarily) disable endpoint TLS"

https://gerrit.wikimedia.org/r/530621

Change 530622 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] Revert "http.go: enable profiling"

https://gerrit.wikimedia.org/r/530622

Change 530621 merged by jenkins-bot:
[mediawiki/services/kask@master] Revert "(Temporarily) disable endpoint TLS"

https://gerrit.wikimedia.org/r/530621

Change 530622 merged by jenkins-bot:
[mediawiki/services/kask@master] Revert "http.go: enable profiling"

https://gerrit.wikimedia.org/r/530622

Change 530627 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] Revert "Deploy 2019-08-14-210839-production Docker image"

https://gerrit.wikimedia.org/r/530627

Change 530628 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] Revert "sessionstore: (Temporarily )use HTTP for liveness"

https://gerrit.wikimedia.org/r/530628

Change 530627 merged by Eevans:
[operations/deployment-charts@master] Revert "Deploy 2019-08-14-210839-production Docker image"

https://gerrit.wikimedia.org/r/530627

Change 530628 merged by Eevans:
[operations/deployment-charts@master] Revert "sessionstore: (Temporarily )use HTTP for liveness"

https://gerrit.wikimedia.org/r/530628

WDoranWMF added a subscriber: WDoranWMF.

Ok, thanks for digging into this @Eevans. I'll raise this at the next SRE sync meeting and explain that we need to hand this over to them to schedule investigating further.

@WDoranWMF we will get to this as soon as our resources allow it.

@Eevans I would ask you if you tried to do the following:

  • measure the latency *at the source*, that is as saw by sessionstore itself
  • measure the latency at the same time *at the destination*, that is your tool for benchmarking

if there is some significant difference, that would probably give us a hint to where the problem lies.

@WDoranWMF we will get to this as soon as our resources allow it.

@Eevans I would ask you if you tried to do the following:

  • measure the latency *at the source*, that is as saw by sessionstore itself
  • measure the latency at the same time *at the destination*, that is your tool for benchmarking

All of the testing done in this ticket (w/ metrics of the latter form) have been focused on staging where the dashboards seem to be broken, but if you scrape the /metrics endpoint the results are consistent. The same is true of production where the dashboards do work.

Fixed it I think (at least in staging temporarily). Have a look at https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=sessionstore&from=1567432126930&to=1567435225421 and let me know how those numbers look (spoiler, 99.9% of requests below 5ms). This was related to linux scheduler limiting the CPU for the pods, that is we had limits way too low. We must revisit the process with which we come up with those numbers, it clearly failed us.

Change 533922 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump limits and requests

https://gerrit.wikimedia.org/r/533922

As fan FYI, the 3 different stanzas of data in https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=sessionstore&from=1567432126930&to=1567435225421 are 3 distinct benchmarkings, limits are respectively 150m, 1500m, 2500m. Some observations off the top of my head and in no particular order:

  • 5k requests is not a number I 'd expect the staging cluster to have to serve. We are clearly lacking a load testing component in our entire infrastructure (pretty much everywhere), but I can't say the staging cluster was designed with that in mind. That being said, tests were reportedly also ran against production so it's clearly not a capacity issue.
  • The 2nd benchmarking has a total of user+systemd+throttled that is higher than the total of user+system of the third case (the only case where no throttling existed) It also has a limit that is higher than the total of the third case. Numbers wise speaking, there should not be any throttling in case #2 either, but that seems to not be true.
  • The first case (where there's heavy throttling) has a total CPU stat of almost 2x of the best behaving use case. Not sure why yet.
  • There is a service endpoint for the staging cluster now. It's staging.svc.eqiad.wmnet so exporting KASK_URL=https://staging.svc.eqiad.wmnet:8081 uses that cluster.
  • Our method of deriving requests/limits for the pod seems to have failed us. We need to figure out why and amend it.

As fan FYI, the 3 different stanzas of data in https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=sessionstore&from=1567432126930&to=1567435225421 are 3 distinct benchmarkings, limits are respectively 150m, 1500m, 2500m. Some observations off the top of my head and in no particular order:

Wow.

image.png (356×451 px, 23 KB)
Nuff said

Regarding those limits, is this documentation relevant?

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container

  • 5k requests is not a number I 'd expect the staging cluster to have to serve. We are clearly lacking a load testing component in our entire infrastructure (pretty much everywhere), but I can't say the staging cluster was designed with that in mind. That being said, tests were reportedly also ran against production so it's clearly not a capacity issue.
  • The 2nd benchmarking has a total of user+systemd+throttled that is higher than the total of user+system of the third case (the only case where no throttling existed) It also has a limit that is higher than the total of the third case. Numbers wise speaking, there should not be any throttling in case #2 either, but that seems to not be true.
  • The first case (where there's heavy throttling) has a total CPU stat of almost 2x of the best behaving use case. Not sure why yet.
  • There is a service endpoint for the staging cluster now. It's staging.svc.eqiad.wmnet so exporting KASK_URL=https://staging.svc.eqiad.wmnet:8081 uses that cluster.

Oh hell, this right here made my week. Thank you! :)

  • Our method of deriving requests/limits for the pod seems to have failed us. We need to figure out why and amend it.

Change 533922 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump limits and requests

https://gerrit.wikimedia.org/r/533922

@akosiaris are we OK to merge this? Happy to do so and deploy, just making sure it's there.

As fan FYI, the 3 different stanzas of data in https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&var-dc=eqiad%20prometheus%2Fk8s-staging&var-service=sessionstore&from=1567432126930&to=1567435225421 are 3 distinct benchmarkings, limits are respectively 150m, 1500m, 2500m. Some observations off the top of my head and in no particular order:

Wow.

image.png (356×451 px, 23 KB)
Nuff said

Regarding those limits, is this documentation relevant?

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container

Yes, exactly that. For reference the numbers we had are documented in T220401#5128786. Need to figure out what I did wrong there...

  • 5k requests is not a number I 'd expect the staging cluster to have to serve. We are clearly lacking a load testing component in our entire infrastructure (pretty much everywhere), but I can't say the staging cluster was designed with that in mind. That being said, tests were reportedly also ran against production so it's clearly not a capacity issue.
  • The 2nd benchmarking has a total of user+systemd+throttled that is higher than the total of user+system of the third case (the only case where no throttling existed) It also has a limit that is higher than the total of the third case. Numbers wise speaking, there should not be any throttling in case #2 either, but that seems to not be true.
  • The first case (where there's heavy throttling) has a total CPU stat of almost 2x of the best behaving use case. Not sure why yet.
  • There is a service endpoint for the staging cluster now. It's staging.svc.eqiad.wmnet so exporting KASK_URL=https://staging.svc.eqiad.wmnet:8081 uses that cluster.

Oh hell, this right here made my week. Thank you! :)

  • Our method of deriving requests/limits for the pod seems to have failed us. We need to figure out why and amend it.

Change 533922 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump limits and requests

https://gerrit.wikimedia.org/r/533922

@akosiaris are we OK to merge this? Happy to do so and deploy, just making sure it's there.

Yes, feel free. I can deploy it as well/be around when you do.

Change 533922 merged by Eevans:
[operations/deployment-charts@master] sessionstore: Bump limits and requests

https://gerrit.wikimedia.org/r/533922

Change 534515 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] staging/sessionstore: bump memory back to 100Mi in response to errors

https://gerrit.wikimedia.org/r/534515

Change 534515 merged by Eevans:
[operations/deployment-charts@master] staging/sessionstore: bump memory back to 100Mi in response to errors

https://gerrit.wikimedia.org/r/534515

[ ... ]

@akosiaris are we OK to merge this? Happy to do so and deploy, just making sure it's there.

Yes, feel free. I can deploy it as well/be around when you do.

I rolled the dice this afternoon, and lost.

When the first deploy to staging failed, the output of kubectl get events seemed to indicate that memory had to be >= 100Mi:

148m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-j6vmp" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
248m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-ql5fk" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
348m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-s69v4" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
448m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-gqc7d" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
548m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-jt8fj" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
648m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-qgjpk" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
748m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-jdpsr" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
848m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-87z2c" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
948m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-sn9t9" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
1037m Warning FailedCreate ReplicaSet (combined from similar events): Error creating: pods "kask-staging-c796559cd-spd5v" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
1126m Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-55t68" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]
129m52s Warning FailedCreate ReplicaSet Error creating: pods "kask-staging-c796559cd-ktnh5" is forbidden: [minimum memory usage per Container is 100Mi, but request is 50Mi., minimum memory usage per Pod is 100Mi, but request is 52428800.]

So I tried bumping that back up to 100Mi, but I still wasn't able to get a successful run of helmfile apply, and it's not clear to me why.

When the first deploy to staging failed, the output of kubectl get events seemed to indicate that memory had to be >= 100Mi:

Yup, that's a limit range we have there to avoid too overpowered/underpowered pods. I messed up https://gerrit.wikimedia.org/r/#/c/operations/deployment-charts/+/533922/ it seems.

So I tried bumping that back up to 100Mi, but I still wasn't able to get a successful run of helmfile apply, and it's not clear to me why.

I wasn't either. I ended up doing a helmfile destroy && helmfile apply to fix it, but we will need to have a better look at why this happened, since even manually fixing the underlying cause did not make the release move from FAILED to DEPLOYED

Change 534613 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump again memory limits/requests

https://gerrit.wikimedia.org/r/534613

Change 534613 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] sessionstore: Bump again memory limits/requests

https://gerrit.wikimedia.org/r/534613

https://gerrit.wikimedia.org/r/534613 merged and deployed fine.

I wasn't either. I ended up doing a helmfile destroy && helmfile apply to fix it, but we will need to have a better look at why this happened, since even manually fixing the underlying cause did not make the release move from FAILED to DEPLOYED

helmfile sync fixed it though after all.

https://gerrit.wikimedia.org/r/534613 merged and deployed fine.

I wasn't either. I ended up doing a helmfile destroy && helmfile apply to fix it, but we will need to have a better look at why this happened, since even manually fixing the underlying cause did not make the release move from FAILED to DEPLOYED

helmfile sync fixed it though after all.

@akosiaris was this applied to production as well? The output of kubectl get ... would suggest it was, but the performance says otherwise

1eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.eqiad.wmnet:8081
10multiplerequests: Found 101000 requests
11multiplerequests: Found 101000 requests
12multiplerequests: Found 101000 requests
13Running 10m test @ https://sessionstore.svc.eqiad.wmnet:8081
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 19.96ms 18.05ms 270.12ms 64.20%
17 Req/Sec 251.88 37.75 360.00 59.98%
18 Latency Distribution
19 50% 22.14ms
20 75% 37.76ms
21 90% 37.95ms
22 99% 38.88ms
23 301034 requests in 10.00m, 50.53MB read
24 Non-2xx or 3xx responses: 66355
25Requests/sec: 501.67
26Transfer/sec: 86.23KB
27------------------------------
2810%,1703
2920%,1815
3030%,1920
3140%,2057
3250%,22144
3360%,37591
3470%,37706
3580%,37811
3690%,37946
3799%,38882
38duration (micros),600062009
39requests,301034
40bytes,52984874
41connect errors,0
42read errors,0
43write errors,0
44status errors,66355
45timeout errors,0
46------------------------------
47eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081 ./wrk.sh
48+ WRK_ARGS=--latency -d10m
49+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
50+ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081
51+ dirname ./wrk.sh
52+ cd .
53+ [ ! -f multi-request-json.lua ]
54+ [ ! -f requests.json ]
55+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.codfw.wmnet:8081
56multiplerequests: Found 101000 requests
57multiplerequests: Found 101000 requests
58multiplerequests: Found 101000 requests
59Running 10m test @ https://sessionstore.svc.codfw.wmnet:8081
60 2 threads and 10 connections
61 Thread Stats Avg Stdev Max +/- Stdev
62 Latency 56.33ms 18.03ms 301.04ms 55.91%
63 Req/Sec 88.94 12.77 131.00 79.11%
64 Latency Distribution
65 50% 73.43ms
66 75% 74.26ms
67 90% 74.50ms
68 99% 74.99ms
69 106474 requests in 10.00m, 13.35MB read
70Requests/sec: 177.44
71Transfer/sec: 22.78KB
72------------------------------
7310%,38066
7420%,38210
7530%,38338
7640%,38497
7750%,73429
7860%,74046
7970%,74193
8080%,74327
8190%,74495
8299%,74989
83duration (micros),600072527
84requests,106474
85bytes,14000072
86connect errors,0
87read errors,0
88write errors,0
89status errors,0
90timeout errors,0
91------------------------------
92eevans@deploy1001:~$

https://gerrit.wikimedia.org/r/534613 merged and deployed fine.

I wasn't either. I ended up doing a helmfile destroy && helmfile apply to fix it, but we will need to have a better look at why this happened, since even manually fixing the underlying cause did not make the release move from FAILED to DEPLOYED

helmfile sync fixed it though after all.

@akosiaris was this applied to production as well? The output of kubectl get ... would suggest it was, but the performance says otherwise

Yes it was. And it is also evident in https://grafana.wikimedia.org/d/000001590/sessionstore?panelId=28&fullscreen&orgId=1&from=1567768437726&to=1567769419849&var-dc=eqiad%20prometheus%2Fk8s&var-service=sessionstore . Note how the pods are not really throttled (this is a sum over 4 pods so per pod numbers are ~1ms -which is probably an artifact of CFS) and how the cumulative CPU usage (~0.25 cpus) is well below the 10 CPUs that are allocated to this. Throttling isn't to blame for this.

1eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.eqiad.wmnet:8081
10multiplerequests: Found 101000 requests
11multiplerequests: Found 101000 requests
12multiplerequests: Found 101000 requests
13Running 10m test @ https://sessionstore.svc.eqiad.wmnet:8081
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 19.96ms 18.05ms 270.12ms 64.20%
17 Req/Sec 251.88 37.75 360.00 59.98%
18 Latency Distribution
19 50% 22.14ms
20 75% 37.76ms
21 90% 37.95ms
22 99% 38.88ms
23 301034 requests in 10.00m, 50.53MB read
24 Non-2xx or 3xx responses: 66355
25Requests/sec: 501.67
26Transfer/sec: 86.23KB
27------------------------------
2810%,1703
2920%,1815
3030%,1920
3140%,2057
3250%,22144
3360%,37591
3470%,37706
3580%,37811
3690%,37946
3799%,38882
38duration (micros),600062009
39requests,301034
40bytes,52984874
41connect errors,0
42read errors,0
43write errors,0
44status errors,66355
45timeout errors,0
46------------------------------
47eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081 ./wrk.sh
48+ WRK_ARGS=--latency -d10m
49+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
50+ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081
51+ dirname ./wrk.sh
52+ cd .
53+ [ ! -f multi-request-json.lua ]
54+ [ ! -f requests.json ]
55+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.codfw.wmnet:8081
56multiplerequests: Found 101000 requests
57multiplerequests: Found 101000 requests
58multiplerequests: Found 101000 requests
59Running 10m test @ https://sessionstore.svc.codfw.wmnet:8081
60 2 threads and 10 connections
61 Thread Stats Avg Stdev Max +/- Stdev
62 Latency 56.33ms 18.03ms 301.04ms 55.91%
63 Req/Sec 88.94 12.77 131.00 79.11%
64 Latency Distribution
65 50% 73.43ms
66 75% 74.26ms
67 90% 74.50ms
68 99% 74.99ms
69 106474 requests in 10.00m, 13.35MB read
70Requests/sec: 177.44
71Transfer/sec: 22.78KB
72------------------------------
7310%,38066
7420%,38210
7530%,38338
7640%,38497
7750%,73429
7860%,74046
7970%,74193
8080%,74327
8190%,74495
8299%,74989
83duration (micros),600072527
84requests,106474
85bytes,14000072
86connect errors,0
87read errors,0
88write errors,0
89status errors,0
90timeout errors,0
91------------------------------
92eevans@deploy1001:~$

The p99 looks suspiciously close to the cross DC latency, is there any way it is related?

1eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081 ./wrk.sh
2+ WRK_ARGS=--latency -d10m
3+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
4+ KASK_URL=https://sessionstore.svc.eqiad.wmnet:8081
5+ dirname ./wrk.sh
6+ cd .
7+ [ ! -f multi-request-json.lua ]
8+ [ ! -f requests.json ]
9+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.eqiad.wmnet:8081
10multiplerequests: Found 101000 requests
11multiplerequests: Found 101000 requests
12multiplerequests: Found 101000 requests
13Running 10m test @ https://sessionstore.svc.eqiad.wmnet:8081
14 2 threads and 10 connections
15 Thread Stats Avg Stdev Max +/- Stdev
16 Latency 19.96ms 18.05ms 270.12ms 64.20%
17 Req/Sec 251.88 37.75 360.00 59.98%
18 Latency Distribution
19 50% 22.14ms
20 75% 37.76ms
21 90% 37.95ms
22 99% 38.88ms
23 301034 requests in 10.00m, 50.53MB read
24 Non-2xx or 3xx responses: 66355
25Requests/sec: 501.67
26Transfer/sec: 86.23KB
27------------------------------
2810%,1703
2920%,1815
3030%,1920
3140%,2057
3250%,22144
3360%,37591
3470%,37706
3580%,37811
3690%,37946
3799%,38882
38duration (micros),600062009
39requests,301034
40bytes,52984874
41connect errors,0
42read errors,0
43write errors,0
44status errors,66355
45timeout errors,0
46------------------------------
47eevans@deploy1001:~$ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081 ./wrk.sh
48+ WRK_ARGS=--latency -d10m
49+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so
50+ KASK_URL=https://sessionstore.svc.codfw.wmnet:8081
51+ dirname ./wrk.sh
52+ cd .
53+ [ ! -f multi-request-json.lua ]
54+ [ ! -f requests.json ]
55+ LUA_CPATH=/usr/lib/x86_64-linux-gnu/lua/5.1/?.so wrk --latency -d10m -s multi-request-json.lua https://sessionstore.svc.codfw.wmnet:8081
56multiplerequests: Found 101000 requests
57multiplerequests: Found 101000 requests
58multiplerequests: Found 101000 requests
59Running 10m test @ https://sessionstore.svc.codfw.wmnet:8081
60 2 threads and 10 connections
61 Thread Stats Avg Stdev Max +/- Stdev
62 Latency 56.33ms 18.03ms 301.04ms 55.91%
63 Req/Sec 88.94 12.77 131.00 79.11%
64 Latency Distribution
65 50% 73.43ms
66 75% 74.26ms
67 90% 74.50ms
68 99% 74.99ms
69 106474 requests in 10.00m, 13.35MB read
70Requests/sec: 177.44
71Transfer/sec: 22.78KB
72------------------------------
7310%,38066
7420%,38210
7530%,38338
7640%,38497
7750%,73429
7860%,74046
7970%,74193
8080%,74327
8190%,74495
8299%,74989
83duration (micros),600072527
84requests,106474
85bytes,14000072
86connect errors,0
87read errors,0
88write errors,0
89status errors,0
90timeout errors,0
91------------------------------
92eevans@deploy1001:~$

The p99 looks suspiciously close to the cross DC latency, is there any way it is related?

I'm at a conference this week and slow to response, apologies.

These tests only do reads and writes, which are strictly LOCAL_QUORUM; There should be no (synchronous) cross-DC comms. This would be pretty easy to suss out with certainty from Grafana though.

Note also how low the throughput is in both cases (500 req/s!).

I'll retest and provide some more complete data (time-boxed dashboards, etc) when I get the chance.

Change 537539 had a related patch set uploaded (by Eevans; owner: Eevans):
[mediawiki/services/kask@master] (Actually )be data-center aware.

https://gerrit.wikimedia.org/r/537539

In T229697#5471288, @akosiaris wrote:
[ ... ]

The p99 looks suspiciously close to the cross DC latency, is there any way it is related?

Yes, there is.

TL;DR see https://gerrit.wikimedia.org/r/537539

This was really dumb, but the setup up of the CQL driver didn't include a DC-aware host selector. What is messed up about all this is that we saw this sort of bi-modal latency pattern in local testing (against a single, local Cassandra node) at one point, before something (unknown), changed. Afterward, the problem was confined to both staging and production, and was most definitely the result of resource limits (fixed by @akosiaris in r/533922). And while r/533922 definitively resolved the issue for staging, it exposed the lack of DC awareness that r/537539 should fix.

I have no idea what was causing the latency when it would manifest locally. A part of me is dying to know more than ever, and another part of me is afraid my head would explode. We can say though that it was not the resource limitations (at least, not the same limitations), and it was not the lack of a DC-aware selector.

The resource limits were most definitely to blame for the a bi-modal latency once it landed in k8s, as the staging environment connects to only a single (local) DC.

I've been bitten by extreme coincidence before when troubleshooting things, but man was it easy to make the assumption that I was looking at the same problem. In hindsight the dashboards should have made it obvious what was happening. In addition to the (absence of ) threshold, the latency distribution became too perfectly bi-modal, with one end too closely matching inter-DC latency (as @akosiaris pointed out), and the Cassandra dashboards demonstrate traffic split equally between the two data-centers (the driver defaults to round-robin).

Change 537552 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] sessionstore: configure cassandra for local_dc

https://gerrit.wikimedia.org/r/537552

Change 537539 merged by jenkins-bot:
[mediawiki/services/kask@master] (Actually )be data-center aware.

https://gerrit.wikimedia.org/r/537539

Change 537552 merged by Eevans:
[operations/deployment-charts@master] sessionstore: configure cassandra for local_dc

https://gerrit.wikimedia.org/r/537552

Change 537747 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/deployment-charts@master] sessionstore: Upgrade image to 2019-09-18-090156-production

https://gerrit.wikimedia.org/r/537747

Change 537747 merged by Eevans:
[operations/deployment-charts@master] sessionstore: Upgrade image to 2019-09-18-090156-production

https://gerrit.wikimedia.org/r/537747

And here is what latency looks like after r/537539 has been deployed (see also: direct link to Grafana dashboard).

image.png (853×1 px, 68 KB)
before/after latency distribution

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.