Page MenuHomePhabricator

Analyze size distribution of wiki page html
Open, Needs TriagePublic

Description

In T360794, we'll be emitting an events to Kafka that include rendered page revision HTML and a diff to the rendered parent revision HTML.

In T344688: Increase Max Message Size in Kafka Jumbo , Kafka jumbo-eqiad's max.message.bytes was increased to 10MB to accommodate large raw revision content. We expect rendered HTML to be larger, but we aren't entirely sure how much larger.

@fkaelin did some analysis on HTML sizes for a days worth of revisions, and found a max HTML size of just under 10MB. The 99% percentile was under 2MB, which indicates that MOST messages with HTML will fit much under the current 10MB limit.

Fab's analysis was only on a day of revisions. To get a more firm estimate, we'd like to extend this analysis to include all page latest revision HTML. To do this, we need a dataset with this data (we aren't going to scrape all pages just for this ;) ).

IIUC, this data should exist in the 'Enterprise HTML dumps'. I don't think we regularly import these into the Data Lake, but I believe that Discovery-Search may have them imported for work on semantic search prototypes.

Done is

  • One off histogram analysis (with max values) of all wiki page HTML page sizes.

Event Timeline

@fkaelin recently did some more analysis on 3 months of revision html -- details in slack. This analysis confirmed that the 99.9 percentile is still small enough, around 2.3MB.

We'd like to also have maximum values for all pages on at least one large wiki, but all (wikitext) wikis is better.

From @ssastry in slack:

the rule of thumb from what I remember from long back is: parsoid html is probably 7x the size of iinput wt ... and given that 2mb is the wt cap, 14mb html is probably a good ballpark estimate for upper limit, but that said, there are pages (not in article namespace) out there that can generate large blobs of html. Plus, this metric might be different for wikis with languages that have multi-byte encoding of its characters unlike say english.

I added some graphs to https://grafana.wikimedia.org/dashboard/snapshot/xzDtG1FZnWCRRaFBQKw2oxpmzK8EUZec -- the tl;dr is a p50 on output size of ~38kB, a p90 of 270kB, and a p99 of 977kB. There are 9 buckets from 2.2k up to 1Mb, so those match reasonably well with the outputs, although the upper bound is probably not terribly reliable given how close it is to our top bucket limit.

Change #1254205 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Increase the kafka-jumbo maximum message size to 20MB

https://gerrit.wikimedia.org/r/1254205