As a maintainer of the search cluster I want to have a control on the size of the documents sent to elasticsearch.
As of today we do not have an explicit limit on the maximum size a document produced by CirrusSearch can be. A limit is indirectly imposed by elasticsearch as it has a 100Mb set through http.max_content_length so we can't push anything bigger than 100Mb.
We should probably have a better control on the doc size to avoid relying on this elasticsearch limit, especially if we want to store this document somewhere else than elastic (kafka).
As of today we don't have a good sense of the size distribution of the various doc indexed in the clusters but @EBernhardson wrote a painless script that could give us a good estimation of what to expect (P32761). Knowing this could inform us about what could be a reasonable limit that does affect too many documents.
Going forward monitoring the doc size from CirrusSearch might be a good idea so that we can get a sense of the actual usage during index updates.
Implementing the limitation might not be entirely trivial given the various fields we index and we might investigate doing an approximation by truncating some well-known fields (auxialiary_text, text, source).
AC:
- have a distribution of the approx doc size we have in our indices
- monitor the doc size seen during index requests from CirrusSearch and graph few top percentiles
- add a configurable size limit in CirrusSearch and truncate the doc accordingly
- decide on a limit in coordination with the Data Engineering Team so that we are confident that the content can be stored in kafka (in anticipation of the search update pipeline rewrite)