Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	elukey
	Mar 12 2018, 11:32 AM

Description

We have recently got some issues with Mirror Maker that seemed to be related to the switch of Monolog from Kafka Analytics to Jumbo.

Mirror Maker seems to have some issues when producing/replicating topics from kafka main-eqiad to jumbo. This is a snippet of logs from kafka1020:

Mar 12 11:00:49 kafka1020 kafka-mirror-maker[31938]: [2018-03-12 11:00:49,500] ERROR Error when sending message to topic codfw.change-prop.transcludes.resource-change with key: null
Mar 12 11:00:49 kafka1020 kafka-mirror-maker[31938]: [2018-03-12 11:00:49,500] ERROR Error when sending message to topic codfw.change-prop.transcludes.resource-change with key: null
Mar 12 11:00:49 kafka1020 kafka-mirror-maker[31938]: [2018-03-12 11:00:49,501] ERROR Error when sending message to topic codfw.change-prop.transcludes.resource-change with key: null
Mar 12 11:00:49 kafka1020 kafka-mirror-maker[31938]: [2018-03-12 11:00:49,501] ERROR Error when sending message to topic codfw.change-prop.transcludes.resource-change with key: null
Mar 12 11:00:50 kafka1020 kafka-mirror-maker[31938]: where num_of_file > 0
Mar 12 11:00:50 kafka1020 kafka-mirror-maker[31938]: GC log rotation is turned off
Mar 12 11:00:50 kafka1020 systemd[1]: kafka-mirror-main-eqiad_to_jumbo-eqiad.service: main process exited, code=exited, status=255/n/a
Mar 12 11:00:50 kafka1020 systemd[1]: Unit kafka-mirror-main-eqiad_to_jumbo-eqiad.service entered failed state.
Mar 12 11:00:53 kafka1020 systemd[1]: kafka-mirror-main-eqiad_to_jumbo-eqiad.service holdoff time over, scheduling restart.
Mar 12 11:00:53 kafka1020 systemd[1]: Stopping Kafka MirrorMaker Instance of main-eqiad_to_jumbo-eqiad...
Mar 12 11:00:53 kafka1020 systemd[1]: Starting Kafka MirrorMaker Instance of main-eqiad_to_jumbo-eqiad...
Mar 12 11:00:53 kafka1020 systemd[1]: Started Kafka MirrorMaker Instance of main-eqiad_to_jumbo-eqiad.
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: Exception in thread "mirrormaker-thread-0" kafka.common.ConsumerRebalanceFailedException: kafka-mirror-main-eqiad_to_jumbo-eqiad
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:660)
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerCo
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(ZookeeperConsumerConnector.scala:1001)
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:163)
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.tools.MirrorMaker$MirrorMakerOldConsumer.init(MirrorMaker.scala:477)
Mar 12 11:01:25 kafka1020 kafka-mirror-maker[34099]: at kafka.tools.MirrorMaker$MirrorMakerThread.run(MirrorMaker.scala:388)
Mar 12 11:25:34 kafka1020 kafka-mirror-maker[34099]: [2018-03-12 11:25:34,881] WARN No broker partitions consumed by consumer thread kafka-mirror-main-eqiad_to_jumbo-eqiad_kafka1020
Mar 12 11:25:34 kafka1020 kafka-mirror-maker[34099]: [2018-03-12 11:25:34,881] WARN No broker partitions consumed by consumer thread kafka-mirror-main-eqiad_to_jumbo-eqiad_kafka1022
Mar 12 11:25:34 kafka1020 kafka-mirror-maker[34099]: [2018-03-12 11:25:34,881] WARN No broker partitions consumed by consumer thread kafka-mirror-main-eqiad_to_jumbo-eqiad_kafka1023

Yesterday (2018-03-11) at around 20:00 UTC mirror maker stopped producing to Jumbo. I've restarted mirror maker on kafka1020, it seemed to work but we ended up in the same situation. Restarting it now again to see if things improve (11:31 UTC, 2018-03-12).

Details

Subject	Repo	Branch	Lines +/-
Increase Kafka MirrorMaker max.request.size to 5.5Mb	operations/puppet	production	+2 -2
Increase MirrorMaker max request size to message.max.bytes + 1Mb	operations/puppet	production	+3 -3
Re-enable job topic mirroring main-eqiad -> jumbo	operations/puppet	production	+2 -5
Use mirror_name label for produce rate alert	operations/puppet	production	+1 -2
Remove profile::kafka::mirror from role analytics b	operations/puppet	production	+0 -1
Remove MirrorMaker configs from analytics_b hosts	operations/puppet	production	+0 -44
Enable 1.1.0 MirrorMaker main-eqiad -> jumbo-eqiad	operations/puppet	production	+29 -28
Blacklisting change-prop and job topics from main -> analytics Mirror	operations/puppet	production	+4 -3
Blacklist jobqueue topics for main -> jumbo mirrormaker (again)	operations/puppet	production	+1 -1
Bump request.timeout.ms and batch.size for main -> jumbo MirrorMaker	operations/puppet	production	+5 -0
Replicate job queue topics main -> jumbo	operations/puppet	production	+1 -1
Use $mirror_name in produce rate alert	operations/puppet	production	+1 -1
Can't use ':' in client.id	operations/puppet	production	+3 -3
Capture client_id from prometheus with : in the name	operations/puppet	production	+2 -2
Fix client_id with ${::hostname} MirrorMaker	operations/puppet	production	+1 -1
Use consistent client.id for mirrormaker producer and consumer	operations/puppet	production	+6 -1
Increase the number of main->jumbo MirrorMaker process to 4 per host	operations/puppet	production	+2 -2
Allow '@' characters in client-id prometheus jmx matching for MirrorMaker	operations/puppet	production	+2 -2
Multi process MirrorMaker	operations/puppet	production	+167 -115
Blacklist job topics from main -> jumbo mirrormaker	operations/puppet	production	+1 -1
Increase main -> jumbo MirrorMaker num.streams to 12	operations/puppet	production	+2 -2
Increase MirrorMaker main -> jumbo heap size	operations/puppet	production	+2 -3
Replicate everything except change-prop and internal topics from main to jumbo	operations/puppet	production	+3 -1
Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker	operations/puppet	production	+10 -0
Use profile::kafka::mirror for main -> jumbo	operations/puppet	production	+44 -90
Use --new.consumer for main -> jumbo mirror maker	operations/puppet	production	+30 -8
Exclude change-prop topics from main -> jumbo MirrorMaker	operations/puppet	production	+1 -1
Whitelist not set, but still getting validation error from puppet, trying false	operations/puppet	production	+2 -2
Blacklist mediawiki.job topics from replication main -> jumbo	operations/puppet	production	+9 -0
Use roundrobin partition.assignment.strategy for Kafka MirrorMaker	operations/puppet	production	+8 -2

Related Objects
Search...

Status	Assigned	Task
Declined	elukey	T166833 Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field
Resolved	Ottomata	T152015 Provision new Kafka cluster(s) with security features
Resolved	Lucas_Werkmeister_WMDE	T145712 Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks )
Resolved	Ottomata	T161731 Create reliable change stream for specific wiki
Resolved	Ottomata	T183303 Decomission old analytics kafka cluster
Resolved	Ottomata	T175461 Port Kafka clients to new jumbo cluster
Resolved	Gehel	T189458 re-enable wdqs kafka poller
Resolved	Ottomata	T189464 Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo
Resolved	Ottomata	T189611 Alert for Kafka MirrorMaker lag
Resolved	Ottomata	T190049 Spike: Consider alternatives to MirrorMaker: uReplicator, Confluent Replicator

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Not sure what is happening still.

I thought that I had reduced the Batch Expired error by setting acks=1. But, I still saw the error. The instances flapped, but since then I've seen no errors and things seem to be fine. Let's keep watching.

Change 418934 merged by Ottomata:
[operations/puppet@production] Use roundrobin partition.assignment.strategy for Kafka MirrorMaker

https://gerrit.wikimedia.org/r/418934

Ottomata added a subtask: T189611: Alert for Kafka MirrorMaker lag.Mar 13 2018, 6:14 PM

Change 419472 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist mediawiki.job topics from replication main -> jumbo

https://gerrit.wikimedia.org/r/419472

Change 419472 merged by Ottomata:
[operations/puppet@production] Blacklist mediawiki.job topics from replication main -> jumbo

https://gerrit.wikimedia.org/r/419472

Change 419475 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Whitelist not set, but still getting validation error from puppet, trying false

https://gerrit.wikimedia.org/r/419475

Change 419475 merged by Ottomata:
[operations/puppet@production] Whitelist not set, but still getting validation error from puppet, trying false

https://gerrit.wikimedia.org/r/419475

Ottomata mentioned this in T189716: Migrate EventStreams to Kafka Jumbo.Mar 14 2018, 5:56 PM

Ottomata moved this task from Incoming to Kafka Work on the Analytics board.Mar 15 2018, 4:35 PM

Ottomata added a project: Analytics-Kanban.

Ottomata claimed this task.Mar 16 2018, 8:17 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 420117 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Exclude change-prop topics from main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/420117

Change 420117 merged by Ottomata:
[operations/puppet@production] Exclude change-prop topics from main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/420117

Ottomata mentioned this in T185229: Investigate why data was missing from mediawiki events around January 3rd.Mar 16 2018, 8:30 PM

Ottomata mentioned this in T190049: Spike: Consider alternatives to MirrorMaker: uReplicator, Confluent Replicator.Mar 19 2018, 2:31 PM

Change 421617 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use --new.consumer for main -> jumbo mirror maker

https://gerrit.wikimedia.org/r/421617

Change 421617 merged by Ottomata:
[operations/puppet@production] Use --new.consumer for main -> jumbo mirror maker

https://gerrit.wikimedia.org/r/421617

Change 421896 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use profile::kafka::mirror for main -> jumbo

https://gerrit.wikimedia.org/r/421896

Change 421896 merged by Ottomata:
[operations/puppet@production] Use profile::kafka::mirror for main -> jumbo

https://gerrit.wikimedia.org/r/421896

Change 421911 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/421911

Change 421911 merged by Ottomata:
[operations/puppet@production] Add prometheus::jmx_exporter_config for main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/421911

--new.consumer seems to behave wayyyy better than the old one. No wonder they stopped relying on ZK for consumer rebalance.

Check out the new dash @elukey https://grafana-admin.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer !

Tomorrow I'd like to try re-enabling the higher volume topics.

Change 422408 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Replicate everything except change-prop and internal topics from main to jumbo

https://gerrit.wikimedia.org/r/422408

Change 422408 merged by Ottomata:
[operations/puppet@production] Replicate everything except change-prop and internal topics from main to jumbo

https://gerrit.wikimedia.org/r/422408

Change 422431 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Increase MirrorMaker main -> jumbo heap size

https://gerrit.wikimedia.org/r/422431

Change 422431 merged by Ottomata:
[operations/puppet@production] Increase MirrorMaker main -> jumbo heap size

https://gerrit.wikimedia.org/r/422431

Change 422473 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Increase main -> jumbo MirrorMaker num.streams to 12

https://gerrit.wikimedia.org/r/422473

Change 422473 merged by Ottomata:
[operations/puppet@production] Increase main -> jumbo MirrorMaker num.streams to 12

https://gerrit.wikimedia.org/r/422473

Change 423033 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist job topics from main -> jumbo mirrormaker

https://gerrit.wikimedia.org/r/423033

Change 423033 merged by Ottomata:
[operations/puppet@production] Blacklist job topics from main -> jumbo mirrormaker

https://gerrit.wikimedia.org/r/423033

Mentioned in SAL (#wikimedia-analytics) [2018-03-29T20:12:26Z] <ottomata> blacklisted mediawiki.job topics from main -> jumbo MirrorMaker again, don't want to page over the weekend while this still is not stable. T189464

Change 423529 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Multi process MirrorMaker

https://gerrit.wikimedia.org/r/423529

Change 423529 merged by Ottomata:
[operations/puppet@production] Multi process MirrorMaker

https://gerrit.wikimedia.org/r/423529

Change 423534 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow '@' characters in client-id prometheus jmx matching for MirrorMaker

https://gerrit.wikimedia.org/r/423534

Change 423534 merged by Ottomata:
[operations/puppet@production] Allow '@' characters in client-id prometheus jmx matching for MirrorMaker

https://gerrit.wikimedia.org/r/423534

Change 423538 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Increase the number of main->jumbo MirrorMaker process to 4 per host

https://gerrit.wikimedia.org/r/423538

Change 423538 merged by Ottomata:
[operations/puppet@production] Increase the number of main->jumbo MirrorMaker process to 4 per host

https://gerrit.wikimedia.org/r/423538

Change 423570 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use consistent client.id for mirrormaker producer and consumer

https://gerrit.wikimedia.org/r/423570

Change 423570 merged by Ottomata:
[operations/puppet@production] Use consistent client.id for mirrormaker producer and consumer

https://gerrit.wikimedia.org/r/423570

Change 423573 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix client_id with ${::hostname} MirrorMaker

https://gerrit.wikimedia.org/r/423573

Change 423573 merged by Ottomata:
[operations/puppet@production] Fix client_id with ${::hostname} MirrorMaker

https://gerrit.wikimedia.org/r/423573

Change 423575 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Capture client_id from prometheus with : in the name

https://gerrit.wikimedia.org/r/423575

Change 423575 merged by Ottomata:
[operations/puppet@production] Capture client_id from prometheus with : in the name

https://gerrit.wikimedia.org/r/423575

Change 423578 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Can't use ':' in client.id

https://gerrit.wikimedia.org/r/423578

Change 423578 merged by Ottomata:
[operations/puppet@production] Can't use ':' in client.id

https://gerrit.wikimedia.org/r/423578

Change 423685 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use $mirror_name in produce rate alert

https://gerrit.wikimedia.org/r/423685

Change 423685 merged by Ottomata:
[operations/puppet@production] Use $mirror_name in produce rate alert

https://gerrit.wikimedia.org/r/423685

Change 423695 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Replicate job queue topics main -> jumbo

https://gerrit.wikimedia.org/r/423695

Change 423695 merged by Ottomata:
[operations/puppet@production] Replicate job queue topics main -> jumbo

https://gerrit.wikimedia.org/r/423695

@elukey here are some things I've learned, mostly from From https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer

In 0.9, request.timeout.ms is overloaded to 'expire requests in the accumulator as well' (accumulator here is synonymous with the batch buffer). 'The clock starts ticking when the batch is ready.' Also, 'When the batch gets sent out on the wire, we reset the clock for the actual wire timeout request.timeout.ms.'

So request.timeout.ms is actually used for 2 different things! It is used to expire batches that are waiting to be sent (in our case there are likely many of these, since we have max.in.flight.requests=1) AND it is used to expire the in flight batch as it is waiting for ACKs from the brokers. If I understand this correctly, a single batch (once it is ready for sending) can take up to 2*request.timeout.ms to actually be produced. This is an extreme case, as ideally the in flight batch takes less than this to return (the max we've ever seen is around 1.3 seconds, but usually this is more like 70 ms). So, the bit we care about is the 'ready but waiting to be sent'.

Since the timeout DOES apply to waiting batches, I think you are right that having fewer batches to send would actually help here. For high volume topics, we will quickly create batches that are ready to be sent, and each one of those will take a few ms to complete. If some burst (maybe a bunch of very large messages?) cause a batch to take a while to ACK, then this could easily expire many batches.

The article I linked to basically says that the only thing we can do in 0.9 to avoid ready batches expiring is to increase request.timeout.ms, like you also suggested. However:

Bumping up request timeout does not work well because that is an artificial way of dealing with the lack of an accumulator timeout. Setting it high will increase the time to detect broker failures.

I've done it anyway, and set the producers request.timeout.ms to 2 minutes. This actually seems to have worked, and I no longer am seeing expired batches. However, I believe that there must be a block when producer.send() is called, probably when the batch accumulator buffer is full. While blocking, the consumer is not being polled(). The consumer has a value of session.timeout.ms which defaults to 30s. Whatever is occasionally blocking seems to block for longer than those 30s, which causes the coordinator consumer thread to think that the blocked consumer is no longer alive. This triggers a rebalance. I believe that I could increase the value of session.timeout.ms to close to our new 2 minute value for request.timeout.ms and we'd avoid these rebalances, but 0.9 brokers have a group.max.session.timeout.ms=30000 setting. We could up the max on the main eqiad brokers, but now we are starting to get silly.

I believe that https://issues.apache.org/jira/browse/KAFKA-3388 which is fixed in 0.10 will help our situation here, as in most cases, the ready waiting batches won't be expired because of request.timeout.ms anymore.

Everything written in KIP-91 would likely help even more, but it looks as though the patch has been stale for a while, waiting for some love.

So anyway, what to do!? I'm going to puppetize the producer request.timeout.ms increase now, since that seems to at least keep things moving, even if we do have unnecessary periodic rebalances. Even so, I'm reluctant to move forward with Kafka client migration until we upgrade the Kafka main cluster to 1.x and get a more recent MirrorMaker, that has at least solved a few of these problems.

Change 423781 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump request.timeout.ms and batch.size for main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/423781

Change 423781 merged by Ottomata:
[operations/puppet@production] Bump request.timeout.ms and batch.size for main -> jumbo MirrorMaker

https://gerrit.wikimedia.org/r/423781

Alright! Since the changes we made a couple of days ago, main -> jumbo has been mostly stable. I had considered sending some more change-prop traffic through it to try and overload the instance, but the change-prop topics (at the moment at least) are not that busy, and I'm pretty sure it is high volume in individual topic-partitions that cause the timeouts.

So, while MirrorMaker should be mostly stable now, I don't see a huge reason to be hasty with the rest of the Kafka analytics to jumbo client migration. We plan to upgrade Kafka main this quarter, so I suggest we focus on that, and then in the process upgrade MirrorMaker as well. Then we can finish the migration of analytics cluster clients to jumbo.

@Smalyshev, this means: use at your own risk! :) The mediawiki topics you want to subscribe to should be stable now. (BTW, we have some much nicer dashboarding and alerting now: https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer?refresh=5m&orgId=1).

We have the legroom to wait until we also upgrade Kafka main clusters and MirrorMaker before we start relying on the mediawiki mirrored topics in Jumbo. But, I leave it up to you if you want to wait for that too.

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Apr 5 2018, 3:34 PM

Ottomata set the point value for this task to 13.

Change 425410 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist jobqueue topics for main -> jumbo mirrormaker (again)

https://gerrit.wikimedia.org/r/425410

Ottomata moved this task from Done to In Progress on the Analytics-Kanban board.Apr 10 2018, 8:37 PM

Change 425410 merged by Ottomata:
[operations/puppet@production] Blacklist jobqueue topics for main -> jumbo mirrormaker (again)

https://gerrit.wikimedia.org/r/425410

• Nuria closed subtask T190049: Spike: Consider alternatives to MirrorMaker: uReplicator, Confluent Replicator as Resolved.Apr 12 2018, 10:07 PM

• Nuria closed subtask T189611: Alert for Kafka MirrorMaker lag as Resolved.

Ottomata moved this task from In Progress to Next Up on the Analytics-Kanban board.Apr 17 2018, 3:14 PM

Ottomata moved this task from Next Up to Paused on the Analytics-Kanban board.

elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Apr 19 2018, 2:46 PM

Change 430006 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklisting change-prop and job topics from main -> analytics Mirror

https://gerrit.wikimedia.org/r/430006

Change 430006 merged by Ottomata:
[operations/puppet@production] Blacklisting change-prop and job topics from main -> analytics Mirror

https://gerrit.wikimedia.org/r/430006

Ottomata moved this task from Paused to In Progress on the Analytics-Kanban board.May 9 2018, 4:43 PM

Change 432120 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Enable 1.1.0 MirrorMaker main-eqiad -> jumbo-eqiad

https://gerrit.wikimedia.org/r/432120

Change 432120 merged by Ottomata:
[operations/puppet@production] Enable 1.1.0 MirrorMaker main-eqiad -> jumbo-eqiad

https://gerrit.wikimedia.org/r/432120

Change 432124 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove MirrorMaker configs from analytics_b hosts

https://gerrit.wikimedia.org/r/432124

Change 432124 merged by Ottomata:
[operations/puppet@production] Remove MirrorMaker configs from analytics_b hosts

https://gerrit.wikimedia.org/r/432124

Change 432125 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove profile::kafka::mirror from role analytics b

https://gerrit.wikimedia.org/r/432125

Change 432125 merged by Ottomata:
[operations/puppet@production] Remove profile::kafka::mirror from role analytics b

https://gerrit.wikimedia.org/r/432125

Change 432127 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use mirror_name label for produce rate alert

https://gerrit.wikimedia.org/r/432127

Change 432127 merged by Ottomata:
[operations/puppet@production] Use mirror_name label for produce rate alert

https://gerrit.wikimedia.org/r/432127

Change 433005 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable job topic mirroring main-eqiad -> jumbo

https://gerrit.wikimedia.org/r/433005

Change 433005 merged by Ottomata:
[operations/puppet@production] Re-enable job topic mirroring main-eqiad -> jumbo

https://gerrit.wikimedia.org/r/433005

Hm, am seeing

[2018-05-14 22:18:20,458] 17217831 [mirrormaker-thread-6] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  - Error when sending message to topic eqiad.mediawiki.job.RecordLintJob with key: null, value: 5191699 bytes with error:
org.apache.kafka.common.errors.RecordTooLargeException: The message is 5191787 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

followed by lots of

[2018-05-14 22:18:20,495] 17217868 [kafka-producer-network-thread | kafka-mirror-kafka-jumbo1001-main-eqiad_to_jumbo-eqiad@0] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  - Error when sending message to topic eqiad.mediawiki.job.htmlCacheUpdate with key: null, value: 9
63 bytes with error:
java.lang.IllegalStateException: Producer is closed forcefully.
        at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:696)
        at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:683)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:190)
        at java.lang.Thread.run(Thread.java:748)

I don't know how such a record could make it into Kafka at all, since we have max bytes configured to 4242868 everywhere.

It seems this topic is stuck, since it can't proceed beyond this record. This causes MM instances to die when they are assigned this topic partition, so they are flapping everywhere. Hm.

I just recommitted the offset for eqiad.mediawiki.job.RecordLintJob, hopefully that will skip this weird message for now...

Change 433092 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Increase MirrorMaker max request size to message.max.bytes + 1Mb

https://gerrit.wikimedia.org/r/433092

Change 433092 merged by Ottomata:
[operations/puppet@production] Increase MirrorMaker max request size to message.max.bytes + 1Mb

https://gerrit.wikimedia.org/r/433092

Ottomata moved this task from In Progress to In Code Review on the Analytics-Kanban board.May 15 2018, 2:21 PM

Just got another

[2018-05-21 00:05:59,778] 2363 [mirrormaker-thread-5] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  - Error when sending message to topic eqiad.mediawiki.job.cirrusSearchElasticaWrite with key: null, value: 5272592 bytes with error:
org.apache.kafka.common.errors.RecordTooLargeException: The message is 5272680 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

I am not sure how these large messages are even making into Kafka main-eqiad in the first place. Increasing the MirrorMaker producer max.request.size again...

Change 434290 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Increase Kafka MirrorMaker max.request.size to 5.5Mb

https://gerrit.wikimedia.org/r/434290

Change 434290 merged by Ottomata:
[operations/puppet@production] Increase Kafka MirrorMaker max.request.size to 5.5Mb

https://gerrit.wikimedia.org/r/434290

Mentioned in SAL (#wikimedia-analytics) [2018-05-21T01:20:39Z] <ottomata> bouncing main -> jumbo MirrorMaker with increased max.request.size - T189464

Mentioned in SAL (#wikimedia-operations) [2018-05-21T01:20:52Z] <ottomata> bouncing main -> jumbo MirrorMaker with increased max.request.size - T189464

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.May 22 2018, 3:05 PM

Ottomata moved this task from Done to In Code Review on the Analytics-Kanban board.Jun 6 2018, 4:05 PM

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 7 2018, 4:02 PM

• Nuria closed this task as Resolved.Jun 11 2018, 11:04 PM

Aklapper removed a project: Analytics.Jul 4 2020, 7:59 AM

Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumboClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...