Page MenuHomePhabricator

Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics
Closed, ResolvedPublic

Description

Several times, I've encountered a huge message for CirrusSearch related job queue topics. These are causing MirrorMaker instances to fail, because these messages are larger than our max.request.size of 5.5MB. Our supposed configured Kafka max.message.size is 4MB. An individual produce request might be larger than 4MB due to metadata, etc. but not 1.5MB larger. I've increased max.request.size several times since encountering this, but now I think it is time to figure out how these messages are making it in to Kafka in the first place.

Somehow, an 8MB message is making it into the cirrusSearchElasticaWrite job topic. This shouldn't happen.

[2018-05-31 01:09:05,546] 311676 [mirrormaker-thread-5] ERROR org.apache.kafka.clients.producer.internals.ErrorLoggingCallback  - Error when sending message to topic eqiad.mediawiki.job.cirrusSearchElasticaWrite with key: null, value: 8619618 bytes with error:
org.apache.kafka.common.errors.RecordTooLargeException: The message is 8619706 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.

Event Timeline

Change 436436 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Blacklist cirrusSearch.* job queue topics from main-eqiad -> jumbo-eqiad

https://gerrit.wikimedia.org/r/436436

Change 436436 merged by Ottomata:
[operations/puppet@production] Blacklist cirrusSearch.* job queue topics from main-eqiad -> jumbo-eqiad

https://gerrit.wikimedia.org/r/436436

Change 436437 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Regex fix for cirrusSearch job queue topic blacklist for Kafka MirrorMaker

https://gerrit.wikimedia.org/r/436437

Change 436437 merged by Ottomata:
[operations/puppet@production] Regex fix for cirrusSearch job queue topic blacklist for Kafka MirrorMaker

https://gerrit.wikimedia.org/r/436437

@Ottomata a theory: we've enabled compression in Event-Platform Value Stream but AFAIK not in MirrorMaker - thus the compressed job is < 4 Mb while the non-compressed job is 8 Mb. When I've tested compression for ElasticaWrite messages the 50% ratio was not that unusual.

Edit: disregard my theory, I can see in puppet we do use compression in mirror maker.

Oo, good thought. I think you might be right. Will investigate and try today.

Nuria triaged this task as Medium priority.May 31 2018, 4:37 PM
Nuria moved this task from Incoming to Operational Excellence on the Analytics board.
Nuria added a project: Analytics-Kanban.

Ah! Yup, we do use compression = snappy. Just noticed that you re commented. Hm.

Vvjjkkii renamed this task from Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics to zxbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
mobrovac renamed this task from zxbaaaaaaa to Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics.Jul 1 2018, 11:14 AM
mobrovac lowered the priority of this task from High to Medium.
mobrovac updated the task description. (Show Details)

So, we've recently re-enabled job queue topic mirroring between main-eqiad and main-codfw, including this cirrusSearchElasticaWrite topic. I haven't seen any of these errors in those MirrorMaker instances, which either means that: there haven't been any of these huge messages since we started mirroring again, OR there is something specific to the main-eqiad -> jumbo-eqiad instance configuration that causes the message request size when producing to jumbo-eqiad to be too large.

The cirrusSearchElasticaWrite job is relatively rare right now, it mostly gets volume when we are doing maintenance tasks. To tease out if the are any problems i ran a pause against the codfw cluster for 10 minutes, actual usage during maintenance tasks would probably be longer but wanted to check if increasing the volume to capture a wider range of page updates made any difference.

And i logged the pauses against wrong cluster, they are (from SAL log):

2018-07-05:
19:07 ebernhardson: T194678 un-pause cirrussearch writes to codfw
18:55 ebernhardson: T194678 pause cirrussearch writes to codfw to check how kafka+mirrormaker responds

Ottomata lowered the priority of this task from Medium to Low.Aug 15 2018, 5:16 PM
Ottomata removed a project: Analytics-Kanban.
debt claimed this task.