After we've started posting the job events into Event-Platform and Kafka, we've noticed that some of them were rejected by Kafka because of MESSAGE_SIZE_TOO_LARGE error. The limit was increased from 1 Mb to 4Mb, but even after the increase some events are still sometimes rejected. Digging deeper I've found that there are some enormously large jobs.
Example 17 Mb in size originally, I've shortened it.
{ "meta": { "domain": "commons.wikimedia.org", "uri": "https://commons.wikimedia.org/wiki/Special:Badtitle/JobSpecification", "topic": "mediawiki.job.wikibase-InjectRCRecords", "request_id": "faa2e213-fc18-4bd0-9ef1-b716360260a6", "schema_uri": "mediawiki/job/1", "dt": "2017-09-07T21:59:38+00:00", "id": "d80f03e5-9417-11e7-9459-141877615224" }, "page_title": "Special:Badtitle/JobSpecification", "database": "commonswiki", "params": { "pages": { "17303097": [6, "'A_Missionary_Preaching_to_the_Natives,_under_a_Skreen_of_platted_Cocoa-nut_leaves_at_Kairua'_by_William_Ellis.jpg"], "14883442": [6, "PL_J\\u00f3zef_Ignacy_Kraszewski-Lubonie_tom_II_077.jpeg"], "2653965": [6, "Meyers_b9_s0043.jpg"], (..a few million further page entries...) }, "change": { "info": "...", "user_id": "194202", "object_id": "Q36180", "time": "20170907215252", "revision_id": "553663638", "type": "wikibase-item~update", "id": 550757876 } }, "type": "wikibase-InjectRCRecords", "page_namespace": -1 }
There's another example that is 44 Mb is size serialized. Kafka is capable of handling that, but it's not great in dealing with very large messages, so we can't increase the cap indefinitely. Maybe there's something we could do on Wikidata side to reduce the size of these jobs?