Per the new data collection we put together in december, commonswiki_file has grown by 40% in the last 30 days. We need to understand the source of this so we can plan appropriately.
Summary of findings
The size of the search index has been rapidly growing -- 55% between 26 Nov 2020 and 12 Jan 2021 -- beyond our capacity to store this data on current hardware. Left unaddressed, the index will continue to grow and break/hinder search functionality: search services will start going intermittently offline; p95 latency increases to >2s; we reject hundreds of requests per second. We are currently already at capacity (the above has already happened months ago) and the Search team is spending a lot of time finding ways to free up more storage space to account for the rapid indexing growth, which means we can’t focus on new features or other bug fixes.
A large number of PDF documents -- especially those made available during Public Domain Day -- are being uploaded to Commons with OCR text. Search indexing currently looks at all of this associated text and as a result is growing rapidly. Besides introducing a massive amount of file text, OCR is imperfect and often also results in mis-identified text leading to nontrivial amounts of junk text being indexed -- we suspect both factors likely degrade search quality and create performance issues related to current storage constraints.
Plan of action:
Place a default 50kb maximum limit on the amount of file text (including OCR text, but excluding metadata and wikitext) that is indexed for search.