Maniphest T220661

EventGate service runner worker occasionally killed, usually during higher load
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ottomata
	Apr 10 2019, 8:58 PM

Description

It seems that the service-runner worker sometimes misses its heartbeat to the master, which causes the master to kill it. I can reproduce this on staging fairly consistently, if I send it a a lot of events.

We tried to roll out mediawiki.api-request to group1 wikis today, but the frequent worker restarts caused errors, and also caused k8s' operational latency to increase.

Details

Subject	Repo	Branch	Lines +/-
eventgate-analytics - bump resources and add comments about debug mode	operations/deployment-charts	master	+103 -85
eventgate-analytics - also expose service-runner worker inspector port	operations/deployment-charts	master	+87 -71
eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar	operations/deployment-charts	master	+93 -72
eventgate-analytics Add wmfdebug and profile options for debugging	operations/deployment-charts	master	+94 -65
eventgate-analytics - Tweak kafka producer poll and batch settings	operations/deployment-charts	master	+84 -69

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T185233 Modern Event Platform
Resolved	Ottomata	T201068 Modern Event Platform: Stream Intake Service
Resolved	Ottomata	T206785 Modern Event Platform: Stream Intake Service (EventGate): Implementation
Resolved	Ottomata	T220661 EventGate service runner worker occasionally killed, usually during higher load

Event Timeline

Ottomata created this task.Apr 10 2019, 8:58 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptApr 10 2019, 8:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here's an strace of the staging worker just before it is killed:
https://gist.github.com/ottomata/ce1b3afc6213b35678b65b6151da4e72

Followed by:

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.10/eventgate?id=AWoJEYeG256y5QKdsQxC&_g=h@76d7734

2019-04-10T21:04:58 worker stopped sending heartbeats, killing

There is a 5 second delay between these two syscalls:

21:05:00.233006 sendmsg(12, {msg_name={sa_family=AF_INET, sin_port=htons(9125), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=16, msg_iov=[{iov_base="eventgate-analytics.rdkafka.prod"..., iov_len=1427}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1427
21:05:05.757994 futex(0x7fb942539164, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fb942539160, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1

These 2 heartbeats are more than 10s apart:

21:04:48.628548 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21
21:04:59.302101 write(3, "{\"type\":\"heartbeat\"}\n", 21) = 21

I think those are actually the reason for a kill. Even when we do not receive heartbeats we still attempt a graceful shutdown which could take some time, thus the actual KILL only comes seconds after.

Hm, the sendmsg that happened before a 5-second delay was for port 9125, which is statsd_exporter.

Hm, if those two heartbeats are indeed the reason for the kill, then perhaps increasing the worker timeout would be ok. The worker was pretty busy working during that time, and not blocked at all.

I also might experiment with different batch settings. Perhaps the produced batches for the hasty producer are too large.

Ottomata moved this task from Backlog to Estimated/ Discussed on the Event-Platform board.Apr 11 2019, 2:28 PM

• fdans moved this task from Incoming to Event Platform on the Analytics board.Apr 11 2019, 4:23 PM

FYI, note the pod restarts below. Seems like the worker isn't ready in the timespan of the kubelet liveness probes (3x 10s => 30s) and eventually the pod is killed. Does it really take that long to initialize a new worker? We can definitely tune those numbers but 30s for initalizing a worker sounds like a lot.

kubectl -n eventgate-analytics get pods  
NAME                                              READY     STATUS    RESTARTS   AGE
eventgate-analytics-production-664b79b5b8-5qjv9   2/2       Running   4          2d
eventgate-analytics-production-664b79b5b8-5zshd   2/2       Running   9          2d
eventgate-analytics-production-664b79b5b8-68zs2   2/2       Running   7          2d
eventgate-analytics-production-664b79b5b8-957w5   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-95hp4   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-9db9x   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-9zqd8   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-b5ftb   2/2       Running   6          2d
eventgate-analytics-production-664b79b5b8-dz2ch   2/2       Running   3          2d
eventgate-analytics-production-664b79b5b8-fj46j   2/2       Running   6          2d
eventgate-analytics-production-664b79b5b8-g2g2n   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-gm5m5   2/2       Running   9          2d
eventgate-analytics-production-664b79b5b8-lbvbd   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-m642m   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-mcqq4   2/2       Running   10         2d
eventgate-analytics-production-664b79b5b8-n2f8m   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-r7fvk   2/2       Running   8          2d
eventgate-analytics-production-664b79b5b8-wjpww   2/2       Running   12         2d
eventgate-analytics-production-664b79b5b8-wqc2l   2/2       Running   5          2d
eventgate-analytics-production-664b79b5b8-z7zdq   2/2       Running   5          2d

akosiaris triaged this task as Medium priority.Apr 12 2019, 7:08 AM

akosiaris mentioned this in T220808: Re-evaluate kubelet operation latencies alerts.Apr 12 2019, 11:47 AM

jijiki subscribed.Apr 12 2019, 12:08 PM

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping. With 1 worker the approach seems to be incorrect - we are leaving a dangling master there.

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

In T220661#5107329, @Pchelolo wrote:

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

+1 to this, with 2 workers the master can coordinate more easily.

2 workers might be an ok number, but I'd still like to figure out why it is dying in the first place. I can reproduce on staging under load pretty consistently, but that doesn't explain why even the staging worker would sometimes be killed when there is no load at all.

In T220661#5107329, @Pchelolo wrote:

@akosiaris no, spanning up a new worker takes no time, the problem here is actually hilling old worker.

The heartbeat limit currently is 7.5 seconds, after which we do not just SIGKILL the worker, we optimistically try to properly shut it down and if that does not happen, we SIGKILL it after 1 minute. For 'normal' operation with many workers, this is a right approach - master stops dispatching requests to the worker immediately after ordering it to shut down gracefully, with less incoming pressure the worker might recover a bit and get a chance to finish up requests and do it's pre-shutdown housekeeping.

So, why would the worker not send the heartbeat? My guess is a series of CPU bound tasks that end up occupying the event queue and taking > 7.5 seconds to serve, leaving no time for the worker to send the heartbeat. Which begs the question of what exactly are these CPU bound tasks and what is their number. SIGKILL after 1m sounds excessive btw for 6k req/s, perhaps we should lower it in this specific case.

With 1 worker the approach seems to be incorrect - we are leaving a dangling master there.

From what I gather, given enough of the aforementioned tasks, any number of workers could end in this exact state leaving the master dangling.

Having said that the shutdown approach is incorrect for 1 worker situation I still think that the proper fix for this would be increasing the number of workers to at least 2.

Given the above, and unless I am misunderstanding something, this sounds just like a capacity problem. The only solution we have to this has always been to just throw more capacity (in the form of more workers and more machines) to the problem. Which is something we can do here as well, without having to change the amount of workers, just increase the amount of pods, which has an extra bonus right now, in that we won't have to go through the process of sizing the pod again (which we must if we bump the amount of workers). The only caveat we have is if in my hypothesis above queue_size == 1, effectively a single CPU bound request consuming 7.5s to be served. Which is a pathological case I 'd say and addressed in a different way.

Change 504016 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - Tweak kafka producer poll and batch settings

https://gerrit.wikimedia.org/r/504016

gerritbot added a project: Patch-For-Review.Apr 15 2019, 2:18 PM

unless I am misunderstanding something, this sounds just like a capacity problem.

I can reproduce under load, but we also see missed heartbeats under little to no load. This even happens to the staging pod when no one is messing with it:

https://logstash.wikimedia.org/goto/19cdf6fea0ad77d51ed9e19343d35ef2
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.15/eventgate?id=AWofweJqsODVk2E3c65f&_g=h@1251ff0

Change 504016 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - Tweak kafka producer poll and batch settings

https://gerrit.wikimedia.org/r/504016

Ottomata mentioned this in rDEPLOYCHARTS34040f78f7c9: eventgate-analytics - Tweak kafka producer poll and batch settings.Apr 15 2019, 2:31 PM

More data:
https://gist.github.com/ottomata/608d19a7cdb4385170564521e23c4316

This time also stracing with -f to show child thread syscalls. I am doing an ab load test during the time of these logs, and you can see that things get blocked up starting at around 19:08:14. There's a heartbeat at 19:08:13, but then not another one until 19:08:26. There is a lot of stuff going on in between here.

One thing I've noticed is that node-rdkafka starts responding well and sends produce requests in largeish (~100-200) message batches, but quickly degrades and only sends one or two at a time. I don't see other rdkafka queues filling up, at least not according to our rdkafka statsd collector. I could be wrong, but this indicates to me that the slowdown is somewhere in the node app itself, not due to rdkafka. I can see this in the worker logs as well; during the ab run the app slows and starts only accepting new incoming connections once every second or two, instead of many many per second. It does periodically break free and from this behavior and start accepting more requests / second, but then slows again. Eventually the ab run dies because the service times out.

[@deploy1001:/home/otto] $ cat eg-staging-logs-eventgate-analytics-staging-5ff6445bc5-l7mcn | grep -E 'incoming request' | jq .time | tr '."' ' ' | awk '{print $1}' | uniq -c
      1 2019-04-15T19:05:47
      1 2019-04-15T19:05:55
      1 2019-04-15T19:06:05
      1 2019-04-15T19:06:15
      1 2019-04-15T19:06:25
      1 2019-04-15T19:06:35
      1 2019-04-15T19:06:45
      1 2019-04-15T19:06:55
      1 2019-04-15T19:07:05
      1 2019-04-15T19:07:14
    155 2019-04-15T19:07:23
     85 2019-04-15T19:07:24
     94 2019-04-15T19:07:25
    113 2019-04-15T19:07:26
    185 2019-04-15T19:07:27
     39 2019-04-15T19:07:28
      2 2019-04-15T19:07:29
      8 2019-04-15T19:07:30
      5 2019-04-15T19:07:31
      6 2019-04-15T19:07:32
      9 2019-04-15T19:07:33
     14 2019-04-15T19:07:34
      5 2019-04-15T19:07:35
      6 2019-04-15T19:07:36
      7 2019-04-15T19:07:37
     93 2019-04-15T19:07:38
     69 2019-04-15T19:07:39
      1 2019-04-15T19:07:45
      2 2019-04-15T19:07:56
      1 2019-04-15T19:07:58
      1 2019-04-15T19:07:59
    108 2019-04-15T19:08:04
     95 2019-04-15T19:08:05
    170 2019-04-15T19:08:06
    108 2019-04-15T19:08:07
      1 2019-04-15T19:08:09
      1 2019-04-15T19:08:10
      1 2019-04-15T19:08:11
      1 2019-04-15T19:08:13
      1 2019-04-15T19:08:14
      1 2019-04-15T19:08:38

I have more data but am not yet much closer to understanding what is happening here. I also cannot reproduce this locally at all; the worker never misses heartbeats under load or otherwise in minikube.

Change 504346 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics Add wmfdebug and profile options for debugging

https://gerrit.wikimedia.org/r/504346

Ottomata mentioned this in rDEPLOYCHARTS430e7bbc8f71: eventgate-analytics Add wmfdebug and profile options for debugging.Apr 16 2019, 2:52 PM

Ottomata mentioned this in rDEPLOYCHARTSbb6dceb76061: eventgate-analytics Add wmfdebug and profile options for debugging.Apr 16 2019, 2:59 PM

Ottomata mentioned this in rDEPLOYCHARTSddb174e75204: eventgate-analytics Add wmfdebug and profile options for debugging.

Change 504346 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics Add wmfdebug and profile options for debugging

https://gerrit.wikimedia.org/r/504346

Ottomata added a project: Analytics-Kanban.Apr 16 2019, 4:07 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 504418 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar

https://gerrit.wikimedia.org/r/504418

Change 504418 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar

https://gerrit.wikimedia.org/r/504418

Ottomata mentioned this in rDEPLOYCHARTS4253056dc6f8: eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and….Apr 16 2019, 7:48 PM

Change 504419 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - also expose service-runner worker inspector port

https://gerrit.wikimedia.org/r/504419

Change 504419 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - also expose service-runner worker inspector port

https://gerrit.wikimedia.org/r/504419

Ottomata mentioned this in rDEPLOYCHARTS7476313a42a0: eventgate-analytics - also expose service-runner worker inspector port.Apr 16 2019, 8:01 PM

Ottomata mentioned this in rDEPLOYCHARTS56afc78e6ee6: eventgate-analytics - also expose service-runner worker inspector port.

Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:

requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi

(from https://github.com/wikimedia/operations-deployment-charts/blob/5a678d3aaf428697ffe09634fee8b6be9d61a194/charts/eventgate-analytics/values.yaml#L32-L37 )

This allowed me to push ~1800 events/second with no missed heartbeats and no worker deaths! It seems clear that the CPU limits were too strict. Is it possible that the heartbeat gets missed while k8s attempts to increase CPU resources from the initial requested amount up to the limit?

@akosiaris are the new limits that I listed here ok?

In T220661#5116643, @Ottomata wrote:
Been doing a lot to get more data, including enabling node profiling and connecting to node inspector. I didn't learn much, except confirmed that I don't think Kafka CPU culprit. I did see good number of cycles being spent generating the x-request-id header using cassandra-uuid.TimeUuid. This won't happen if we can ever get the x-request-id header to be set by varnish eventually (T201409).

But I didn't see anything particularly abnormal, just a lot of time being spent by express parsing the JSON request bodies, which I guess makes sense.

Just to try, I increased k8s resource requests/limits to:
requests:
  cpu: 1000m
  memory: 100Mi
limits:
  cpu: 2000m
  memory: 200Mi

(from https://github.com/wikimedia/operations-deployment-charts/blob/5a678d3aaf428697ffe09634fee8b6be9d61a194/charts/eventgate-analytics/values.yaml#L32-L37 )

This allowed me to push ~1800 events/second with no missed heartbeats and no worker deaths! It seems clear that the CPU limits were too strict. Is it possible that the heartbeat gets missed while k8s attempts to increase CPU resources from the initial requested amount up to the limit?

@akosiaris are the new limits that I listed here ok?

Nice find!

Bumping the limits stanza is fine as it actually a kubernetes (actually the kernel, but anyway) enforced limit and was by all accounts what was causing the issue. I am fine with both cpu: 2000m and memory: 200Mi for the limits stanza.

The requests stanza however is not a limit. It's a hint to the kubernetes scheduler of the expected size of a pod so that the scheduler is able to take informed decisions about pod placement (effectively avoid overcommiting a host). Increasing it above the actual expected usage will only cause a node to be underutilized. Without that hint (or a bad hint not reflecting reality), the scheduler will only misbehave.

Now, there's some debate around as to whether this should be the minimum numbers or the median/mean (or even maximum, but that approach decreases utilization in the interest of reliability and we aren't there yet) numbers. We 've been doing the former up to now, as it was way easier given we had no real traffic reaching the pods when coming up with those numbers, but rather artificially generated benchmarks.

But now we have numbers. Looking at grafana for irate(container_cpu_usage_seconds_total{job="k8s-node-cadvisor", namespace="$service", container_name=~"$service.*", pod_name=~"$service.*"}[2m]) looks like CPU requests averages on 200m, so that 1000m is too much, although the original 100m was too low. Memory wise and using container_memory_usage_bytes{namespace="eventgate-analytics", job="k8s-node-cadvisor", container_name="eventgate-analytics-staging"} it does look like both the original 50Mi AND the new 100Mi proposed are low. I would bump it to 150Mi.

Change 504547 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics - bump resources and add comments about debug mode

https://gerrit.wikimedia.org/r/504547

Ottomata mentioned this in rDEPLOYCHARTSe289a1246cd7: eventgate-analytics - bump resources and add comments about debug mode.Apr 17 2019, 1:29 PM

Change 504547 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics - bump resources and add comments about debug mode

https://gerrit.wikimedia.org/r/504547

After talking with Alex, we will be using:

requests:
  cpu: 200m
  memory: 150Mi
limits:
  cpu: 2000m
  memory: 300Mi

With these settings, and ab -c1000 I can get around Requests per second: 2058.21 [#/sec] (mean) for hasty and 1705.77 [#/sec] (mean) for guaranteed from the staging pod :)

Ok prod updated with new limits!
I ran two abs from two different machines with -c1000. Both ab processes died due to socket timeout/too many open sockets, but I was able to push prod to 26K msgs/second with CPU max ~1.2s. The currently max expected usage of eventgate-analytics is around 10K msgs/second. So I think we are good to go!

@Pchelolo I'd be ok with trying group1 wikis again this afternoon if you have time!

Agreed, let's do it!

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Apr 17 2019, 4:03 PM

Ottomata moved this task from Estimated/ Discussed to To be Estimated/To be discussed on the Event-Platform board.Apr 19 2019, 4:17 PM

I believe that this can be closed after we are easily sustaining almost 7k events per second in prod.

EventGate service runner worker occasionally killed, usually during higher loadClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

EventGate service runner worker occasionally killed, usually during higher load
Closed, ResolvedPublic
Actions

Related Objects
Search...