Page MenuHomePhabricator

Kafka sometimes misses to rebalance topics properly
Closed, ResolvedPublic

Description

On Nov 03 1 am UTC a daily automatic reds restart happened. Following them there were a lot of logs from ChangeProp that it couldn't connect to redis. Following that change-prop workers stopped sending heartbeats massively and we killed by masters. Following respawning of killed workers happened in a bulk, so Kafka connection errors started occurring.

This ticket is created to find out why did change-prop workers stop sending heartbeats when they've lost connection to Redis.

Event Timeline

One more piece of info - page_edit rule stopped processing completely after midnight on Nov 14 - there was one more instance of all workers dying at midnight after first an huge amount of reds connection logs were emitted and then KafkaConsumer not connected errors on scb1002.

After a while the workers managed to start up back, but after that moment the page_edit rule stopped being processed. My current theory is that scb1002 was the node that was processing the page_edit rule, but after it died Kafka didn't properly rebalance the rule to a different worker.

Imminent solution is to restart change-prop so all the rules could be properly rebalanced, but this is definitely not how this is supposed to work.

Mentioned in SAL (#wikimedia-operations) [2017-11-15T13:31:37Z] <ppchelko@tin> Started restart [changeprop/deploy@065a06e]: Restart to rebalance all rules T179684

I've conducted an experiment in deployment-prep. I've connected ChangeProp directly to deployment-redis06 and generated some extensive load on ChangeProp. Then I've killed a redis instance and observed that there were a bunch of redis connection errors, but a very limited number compared to production. Also the exception was coming from reds client on_error handler, not from the code directly using redis. However, during that time CP completely stopped processing incoming Kafka events. A theory was that redis client waits for connection and stopped the world. I've change the retry policy of reds to never retry and an issue with stopping processing events was gone. So we should try to write a custom retry policy for production.

Also @elukey discovered that MW experiences the same connection issues when redis slaves restart, so that might be another place to look at.

hashar added a subscriber: hashar.

Can you please get puppet enabled again on deployment-changeprop and deployment-redis06 ? There are pending puppet modifications that we need applied on all hosts :]

The last Puppet run was at Thu Nov 16 09:08:12 UTC 2017 (5793 minutes ago). Puppet is disabled. Testing changeprop-redis issue T179684

@hashar Puppet reenabled on both change-prop and reds hosts in deployment-prep a a puppet run was maid. I will be working more on this at some point so will disable it back for a while, but probably not today/tomorrow.

Change 391798 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::redis::jobqueue: stagger redis slave restarts

https://gerrit.wikimedia.org/r/391798

Change 391798 merged by Elukey:
[operations/puppet@production] profile::redis::jobqueue: stagger redis slave restarts

https://gerrit.wikimedia.org/r/391798

Change 392684 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::redis::jobqueue: fix default selector

https://gerrit.wikimedia.org/r/392684

Change 392684 merged by Elukey:
[operations/puppet@production] profile::redis::jobqueue: fix default selector

https://gerrit.wikimedia.org/r/392684

Change 393032 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/change-propagation/deploy@master] [Config] Increase worker_heartbeat_timeout

https://gerrit.wikimedia.org/r/393032

Change 393032 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] [Config] Increase worker_heartbeat_timeout

https://gerrit.wikimedia.org/r/393032

I believe I've identified the cause of this and it's not really related to Redis. Whenever the Kafka consumer was disconnected from Kafka, we were trying to reconnect back in a tight loop that was causing workers to die. I believe the following patch will fix the issue https://github.com/wikimedia/change-propagation/pull/229

I won't resolve the ticket just now, let's deploy the patch and monitor whether workers continue dying.

After the deployment of the https://github.com/wikimedia/change-propagation/pull/229 the death spiral bug is not observed anymore, so I'm resolving the ticket.

This happened again, so reopening. A couple of notes:

  1. The situation is better after deploying the fix, only on KafkaConsumer not connected error per consumer was emitted, so there was no tight loop this time
  2. The workers however still massively died.
  3. It's not clear whether it's related to Redis restarts - it should not be, but the timing matches exactly.
  4. The most worrying issue is that after the restart one of the JobQueue rules didn't get rebalanced to any worker, so it wasn't running for almost 24 hours until the workers got restarted again (and again the timing exactly matches Redis restart timing)

The point number 4 is pretty critical, so I'm raising the bug priority to High

Right now it's problematic to dig the logs because it's hard to filter all the logs by worker id to understand the sequence of events happening with a particular worker. I will continue investigation when we deploy a better logging in service-runner

For the record, this is still happening. Here's the last instance: a worker on scb1001 was in charge of the RecordLintJob topic. Then, scb1004 gets rebooted and the topic stops being processed. So the current theory is that when scb1004 got rebooted, the new workers subscribed to Kafka for the topic, which triggered a rebalance round. However, Kafka failed to assign (a correct/existing?) worker to the topic and it stopped being processed. The thing to note is that this usually happens only for one topic.

The incident happened on 2018-03-02T09:01:00UTC as witnessed on this graph (click on RecordLintJob)

mobrovac renamed this task from ChangeProp workers die if they can't connect to redis to Kafka sometimes misses to rebalance topics properly.Mar 5 2018, 6:09 PM

From kafkfa1002 (that acted as Kafka group coordinator broker afaics) I can see the following logs for change-prop-RecordLintJob:

[2018-03-02 08:59:03,965] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 270 (kafka.coordinator.GroupCoordinator)
[2018-03-02 08:59:33,952] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 271 (kafka.coordinator.GroupCoordinator)
[2018-03-02 08:59:34,048] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 271 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:00:03,969] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 271 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:00:04,966] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 272 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:00:04,970] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 272 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:02:00,586] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 272 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:01,585] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 273 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:01,589] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 273 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:02:02,138] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 273 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:03,590] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 274 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:03,595] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 274 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:02:03,685] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 274 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:05,597] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 275 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:05,601] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 275 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:02:06,808] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 275 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:07,807] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 276 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:07,816] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 276 (kafka.coordinator.GroupCoordinator)

[..]

[2018-03-02 09:02:44,920] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 293 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:46,476] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 294 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:46,487] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 294 (kafka.coordinator.GroupCoordinator)

[2018-03-02 09:02:47,873] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 294 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:49,399] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 295 (kafka.coordinator.GroupCoordinator)
[2018-03-02 09:02:49,406] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 295 (kafka.coordinator.GroupCoordinator)

The above logs are a bit weird since we have 295-275 generations for a consumer group that should read only one partition (so one worker allowed to consume). How many workers try to join a consumer group in a scenario like the one that Marko described? Do we have any log from change prop for that time frame? I tried to check the change-prop's logstash dashboard but didn't find much..

The example that Marko mentioned for the RecordLintJob seems to have started around 2018-03-02 (as outlined below) and completely recovered on the 5th:

https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?panelId=2&fullscreen&orgId=1&from=1519964066648&to=1520353762000

I checked the rebalance logs for the 5th and I can see:

[2018-03-05 13:37:50,781] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 295 (kafka.coordinator.GroupCoordinator)
[2018-03-05 13:38:20,760] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 296 (kafka.coordinator.GroupCoordinator)
[2018-03-05 13:38:20,770] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 296 (kafka.coordinator.GroupCoordinator)
[2018-03-05 13:38:21,142] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 296 (kafka.coordinator.GroupCoordinator)
[2018-03-05 13:38:50,762] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 297 (kafka.coordinator.GroupCoordinator)
[2018-03-05 13:38:50,767] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 297 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:31:38,879] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 297 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:32:08,879] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 298 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:32:08,885] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 298 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:32:09,133] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 298 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:32:38,882] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 299 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:32:38,896] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 299 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:45:34,139] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 299 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:46:04,139] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 300 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:46:04,149] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 300 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:46:04,199] INFO [GroupCoordinator 1002]: Preparing to restabilize group change-prop-RecordLintJob with old generation 300 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:46:34,140] INFO [GroupCoordinator 1002]: Stabilized group change-prop-RecordLintJob generation 301 (kafka.coordinator.GroupCoordinator)
[2018-03-05 15:46:34,154] INFO [GroupCoordinator 1002]: Assignment received from leader for group change-prop-RecordLintJob for generation 301 (kafka.coordinator.GroupCoordinator)

This time the generation was 295 -> 301, spanning over 2h. How did it recover? ChangeProp restart?

I'd also need an example of a time window in which you guys did a deployment for ChangeProp, so I'll be able to compare the kafka broker logs and see how many rebalance happened in that case.

This time the generation was 295 -> 301, spanning over 2h. How did it recover? ChangeProp restart?

Yes, @mobrovac restarted it at 13:37:

The entries for 15:48 is me deploying it again:

So it seems that only 4 rebalance messages are there for a regular deploy, but the number for when CP died is significantly higher forin from generation 270-295, so it might be there's too many rebalances that makes kafka suffer.

I'd also need an example of a time window in which you guys did a deployment for ChangeProp, so I'll be able to compare the kafka broker logs and see how many rebalance happened in that case.

These are start and end of the last deploy:

This happened again today with on_transclusions_update group - it just stopped being consumed completely without a visible reason. There's some logs regarding the topic that this group was consuming and some messages regarding it being rebalanced, but no crazy multi-generation reassignment logs.

Also, I can't figure out why the rebalancing was started in the first place - change-prop workers don't seem to have died prior to the event.

Can you report a graph or a time window so we can investigate in the logs? main-eqiad?

Oh, sorry. It actually just happened again at 07:13 UTC:

Screen Shot 2018-05-08 at 6.30.18 AM.png (431×1 px, 130 KB)

Screen Shot 2018-05-08 at 6.32.05 AM.png (300×662 px, 53 KB)

The important thing here is in eqiad, codfw is just a continuation of that. Here's why: the transclusion rerenders work as follows:

  1. On page edit an event is emitted in eqiad, it's picked up by the page_edit rule in eqiad and sent to a special transclusion module.
  2. The module finds the first bulk of trasclusions, sends resource-change events for each of them in eqiad to change-prop.transcludes.resource-change topic - this is why on the graph production into hat topic doesn't drop to 0.
  3. HERE IT IS BROKEN: The on_transclusion_update rule picks up transclusion update events, rerenders RESTBase in codfw, then on the last one from a bulk, when it finds a continue event, it posts an new bulk of trasclusions.
  4. RESTBase in codfw on rerender posts resource-change events for derived content (like mobile stuff etc) in codfw. But since the previous step is broken - there's no eevnts produced in codfw.

So the core of the problem is that on_transclusion_update rule for some reason stopped processing events.

Pchelolo edited projects, added Services (done); removed Services (doing).

This did not happen for a while, resolving. Please reopen if happens again.