Page MenuHomePhabricator

EventGate service runner worker occasionally killed, usually during higher load
Closed, ResolvedPublic

Description

It seems that the service-runner worker sometimes misses its heartbeat to the master, which causes the master to kill it. I can reproduce this on staging fairly consistently, if I send it a a lot of events.

We tried to roll out mediawiki.api-request to group1 wikis today, but the frequent worker restarts caused errors, and also caused k8s' operational latency to increase.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There is a 5 second delay between these two syscalls:

21:05:00.233006 sendmsg(12, {msg_name={sa_family=AF_INET, sin_port=htons(9125), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=16, msg_iov=[{iov_base="eventgate-analytics.rdkafka.prod"..., iov_len=1427}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1427
21:05:05.757994 futex(0x7fb942539164, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fb942539160, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1

These 2 heartbeats are more than 10s apart:

21:04:48.628548 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21
21:04:59.302101 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21

I think those are actually the reason for a kill. Even when we do not receive heartbeats we still attempt a graceful shutdown which could take some time, thus the actual KILL only comes seconds after.

Hm, the sendmsg that happened before a 5-second delay was for port 9125, which is statsd_exporter.

Hm, if those two heartbeats are indeed the reason for the kill, then perhaps increasing the worker timeout would be ok. The worker was pretty busy working during that time, and not blocked at all.

I also might experiment with different batch settings. Perhaps the produced batches for the hasty producer are too large.

FYI, note the pod restarts below. Seems like the worker isn't ready in the timespan of the kubelet liveness probes (3x 10s => 30s) and eventually the pod is killed. Does it really take that long to initialize a new worker? We can definitely tune those numbers but 30s for initalizing a worker sounds like a lot.

kubectl -n eventgate-analytics get pods  
NAME                                              READY     STATUS    RESTARTS   AGE
eventgate-analytics-production-664b79b5b8-5qjv9   2/2       Running   4          2d
eventgate-analytics-production-664b79b5b8-5zshd   2/2       Running   9          2d
eventgate-analytics-production-664b79b5b8-68zs2   2/2       Running   7          2d
eventgate-analytics-production-664b79b5b8-957w5   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-95hp4   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-9db9x   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-9zqd8   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-b5ftb   2/2       Running   6          2d
eventgate-analytics-production-664b79b5b8-dz2ch   2/2       Running   3          2d
eventgate-analytics-production-664b79b5b8-fj46j   2/2       Running   6          2d
eventgate-analytics-production-664b79b5b8-g2g2n   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-gm5m5   2/2       Running   9          2d
eventgate-analytics-production-664b79b5b8-lbvbd   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-m642m   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-mcqq4   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-n2f8m   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-r7fvk   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-wjpww   2/2       Running   12         2d
eventgate-analytics-production-664b79b5b8-wqc2l   2/2       Running   5          2d
eventgate-analytics-production-664b79b5b8-z7zdq   2/2       Running   5          2d
akosiaris triaged this task as Medium priority.Apr 12 2019, 7:08 AM

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping. With 1 worker the approach seems to be incorrect - we are leaving a dangling master there.

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

+1 to this, with 2 workers the master can coordinate more easily.

2 workers might be an ok number, but I'd still like to figure out why it is dying in the first place. I can reproduce on staging under load pretty consistently, but that doesn't explain why even the staging worker would sometimes be killed when there is no load at all.

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping.

So, why would the worker not send the heartbeat? My guess is a series of CPU bound tasks that end up occupying the event queue and taking > 7.5 seconds to serve, leaving no time for the worker to send the heartbeat. Which begs the question of what exactly are these CPU bound tasks and what is their number. SIGKILL after 1m sounds excessive btw for 6k req/s, perhaps we should lower it in this specific case.

With 1 worker the approach seems to be incorrect - we are leaving a dangling master there.

From what I gather, given enough of the aforementioned tasks, any number of workers could end in this exact state leaving the master dangling.

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

Given the above, and unless I am misunderstanding something, this sounds just like a capacity problem. The only solution we have to this has always been to just throw more capacity (in the form of more workers and more machines) to the problem. Which is something we can do here as well, without having to change the amount of workers, just increase the amount of pods, which has an extra bonus right now, in that we won't have to go through the process of sizing the pod again (which we must if we bump the amount of workers). The only caveat we have is if in my hypothesis above queue_size == 1, effectively a single CPU bound request consuming 7.5s to be served. Which is a pathological case I 'd say and addressed in a different way.

Change 504016 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - Tweak kafka producer poll and batch settings

https://gerrit.wikimedia.org/r/504016

unless I am misunderstanding something, this sounds just like a capacity problem.

I can reproduce under load, but we also see missed heartbeats under little to no load. This even happens to the staging pod when no one is messing with it:

https://logstash.wikimedia.org/goto/19cdf6fea0ad77d51ed9e19343d35ef2
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.15/eventgate?id=AWofweJqsODVk2E3c65f&_g=h@1251ff0

Change 504016 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - Tweak kafka producer poll and batch settings

https://gerrit.wikimedia.org/r/504016

More data:
https://gist.github.com/ottomata/608d19a7cdb4385170564521e23c4316

This time also stracing with -f to show child thread syscalls. I am doing an ab load test during the time of these logs, and you can see that things get blocked up starting at around 19:08:14. There's a heartbeat at 19:08:13, but then not another one until 19:08:26. There is a lot of stuff going on in between here.

One thing I've noticed is that node-rdkafka starts responding well and sends produce requests in largeish (~100-200) message batches, but quickly degrades and only sends one or two at a time. I don't see other rdkafka queues filling up, at least not according to our rdkafka statsd collector. I could be wrong, but this indicates to me that the slowdown is somewhere in the node app itself, not due to rdkafka. I can see this in the worker logs as well; during the ab run the app slows and starts only accepting new incoming connections once every second or two, instead of many many per second. It does periodically break free and from this behavior and start accepting more requests / second, but then slows again. Eventually the ab run dies because the service times out.

[@deploy1001:/home/otto] $ cat eg-staging-logs-eventgate-analytics-staging-5ff6445bc5-l7mcn | grep -E 'incoming request' | jq .time | tr '."' ' ' | awk '{print $1}' | uniq -c
      1 2019-04-15T19:05:47
      1 2019-04-15T19:05:55
      1 2019-04-15T19:06:05
      1 2019-04-15T19:06:15
      1 2019-04-15T19:06:25
      1 2019-04-15T19:06:35
      1 2019-04-15T19:06:45
      1 2019-04-15T19:06:55
      1 2019-04-15T19:07:05
      1 2019-04-15T19:07:14
    155 2019-04-15T19:07:23
     85 2019-04-15T19:07:24
     94 2019-04-15T19:07:25
    113 2019-04-15T19:07:26
    185 2019-04-15T19:07:27
     39 2019-04-15T19:07:28
      2 2019-04-15T19:07:29
      8 2019-04-15T19:07:30
      5 2019-04-15T19:07:31
      6 2019-04-15T19:07:32
      9 2019-04-15T19:07:33
     14 2019-04-15T19:07:34
      5 2019-04-15T19:07:35
      6 2019-04-15T19:07:36
      7 2019-04-15T19:07:37
     93 2019-04-15T19:07:38
     69 2019-04-15T19:07:39
      1 2019-04-15T19:07:45
      2 2019-04-15T19:07:56
      1 2019-04-15T19:07:58
      1 2019-04-15T19:07:59
    108 2019-04-15T19:08:04
     95 2019-04-15T19:08:05
    170 2019-04-15T19:08:06
    108 2019-04-15T19:08:07
      1 2019-04-15T19:08:09
      1 2019-04-15T19:08:10
      1 2019-04-15T19:08:11
      1 2019-04-15T19:08:13
      1 2019-04-15T19:08:14
      1 2019-04-15T19:08:38

I have more data but am not yet much closer to understanding what is happening here. I also cannot reproduce this locally at all; the worker never misses heartbeats under load or otherwise in minikube.

Change 504346 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics Add wmfdebug and profile options for debugging

https://gerrit.wikimedia.org/r/504346

Change 504346 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics Add wmfdebug and profile options for debugging

https://gerrit.wikimedia.org/r/504346

Change 504418 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar

https://gerrit.wikimedia.org/r/504418

Change 504418 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar

https://gerrit.wikimedia.org/r/504418

Change 504419 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - also expose service-runner worker inspector port

https://gerrit.wikimedia.org/r/504419

Change 504419 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - also expose service-runner worker inspector port

https://gerrit.wikimedia.org/r/504419

Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:

requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi

(from https://github.com/wikimedia/operations-deployment-charts/blob/5a678d3aaf428697ffe09634fee8b6be9d61a194/charts/eventgate-analytics/values.yaml#L32-L37 )

This allowed me to push ~1800 events/second with no missed heartbeats and no worker deaths! It seems clear that the CPU limits were too strict. Is it possible that the heartbeat gets missed while k8s attempts to increase CPU resources from the initial requested amount up to the limit?

@akosiaris are the new limits that I listed here ok?

Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:

requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi

(from https://github.com/wikimedia/operations-deployment-charts/blob/5a678d3aaf428697ffe09634fee8b6be9d61a194/charts/eventgate-analytics/values.yaml#L32-L37 )

This allowed me to push ~1800 events/second with no missed heartbeats and no worker deaths! It seems clear that the CPU limits were too strict. Is it possible that the heartbeat gets missed while k8s attempts to increase CPU resources from the initial requested amount up to the limit?

@akosiaris are the new limits that I listed here ok?

Nice find!

Bumping the limits stanza is fine as it actually a kubernetes (actually the kernel, but anyway) enforced limit and was by all accounts what was causing the issue. I am fine with both cpu: 2000m and memory: 200Mi for the limits stanza.

The requests stanza however is not a limit. It's a hint to the kubernetes scheduler of the expected size of a pod so that the scheduler is able to take informed decisions about pod placement (effectively avoid overcommiting a host). Increasing it above the actual expected usage will only cause a node to be underutilized. Without that hint (or a bad hint not reflecting reality), the scheduler will only misbehave.

Now, there's some debate around as to whether this should be the minimum numbers or the median/mean (or even maximum, but that approach decreases utilization in the interest of reliability and we aren't there yet) numbers. We 've been doing the former up to now, as it was way easier given we had no real traffic reaching the pods when coming up with those numbers, but rather artificially generated benchmarks.

But now we have numbers. Looking at grafana for irate(container_cpu_usage_seconds_total{job="k8s-node-cadvisor", namespace="$service", container_name=~"$service.*", pod_name=~"$service.*"}[2m]) looks like CPU requests averages on 200m, so that 1000m is too much, although the original 100m was too low. Memory wise and using container_memory_usage_bytes{namespace="eventgate-analytics", job="k8s-node-cadvisor", container_name="eventgate-analytics-staging"} it does look like both the original 50Mi AND the new 100Mi proposed are low. I would bump it to 150Mi.

Change 504547 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - bump resources and add comments about debug mode

https://gerrit.wikimedia.org/r/504547

Change 504547 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - bump resources and add comments about debug mode

https://gerrit.wikimedia.org/r/504547

After talking with Alex, we will be using:

requests:
  cpu: 200m
  memory: 150Mi
limits:
  cpu: 2000m
  memory: 300Mi

With these settings, and ab -c1000 I can get around Requests per second: 2058.21 [#/sec] (mean) for hasty and 1705.77 [#/sec] (mean) for guaranteed from the staging pod :)

Ok prod updated with new limits!
I ran two abs from two different machines with -c1000. Both ab processes died due to socket timeout/too many open sockets, but I was able to push prod to 26K msgs/second with CPU max ~1.2s. The currently max expected usage of eventgate-analytics is around 10K msgs/second. So I think we are good to go!

@Pchelolo I'd be ok with trying group1 wikis again this afternoon if you have time!

I believe that this can be closed after we are easily sustaining almost 7k events per second in prod.