In T360794, we'll be emitting an events to Kafka that include rendered page revision HTML and a diff to the rendered parent revision HTML.
In T344688: Increase Max Message Size in Kafka Jumbo , Kafka jumbo-eqiad's max.message.bytes was increased to 10MB to accommodate large raw revision content. We expect rendered HTML to be larger, but we aren't entirely sure how much larger.
@fkaelin did some analysis on HTML sizes for a days worth of revisions, and found a max HTML size of just under 10MB. The 99% percentile was under 2MB, which indicates that MOST messages with HTML will fit much under the current 10MB limit.
Fab's analysis was only on a day of revisions. To get a more firm estimate, we'd like to extend this analysis to include all page latest revision HTML. To do this, we need a dataset with this data (we aren't going to scrape all pages just for this ;) ).
IIUC, this data should exist in the 'Enterprise HTML dumps'. I don't think we regularly import these into the Data Lake, but I believe that Discovery-Search may have them imported for work on semantic search prototypes.
Done is
- One off histogram analysis (with max values) of all wiki page HTML page sizes.