Kiwix crawls all pages on a wiki with some periodicity (1 week, many weeks?) as part of preparing offline versions of wikipedias (and other wikis). If these Kiwix requests filters down to an API that is backed by the ParserCache, this could have the inadvertent effect of causing infrequently accessed pages to be stuffed into ParserCache and potentially causing it to run of of space if the collective ParserCache disk space is not big enough to hold the entire content of all wikis that Kiwix crawls.
Thus far, Kiwix has been hitting MCS for Parsoid content, and the core API for getting additional missing metadata from the legacy parser content. MCS requests filtered down to RESTBase and came from RESTBase storage. But, the metadata request to the core API would likely have caused those pages to get stuffed into the ParserCache. Assuming this is correct, then Kiwix has been causing ParserCache to fill up with all pages of a wiki for a while and somehow this hasn't cause ParserCache storage to not blow up so far.
But, MCS is on its way to being decommissioned. At that time, Kiwix will switch to calling the REST API for getting Parsoid content. This can now cause Parsoid HTML for all pages of a wiki to get stuffed into ParserCache as well in addition to the legacy HTML version as well. This could potentially make things worse.
So, we need a few things to happen here:
- Is it true that currently, legacy HTML for all pages of a wiki are getting stuffed into ParserCache? If so, is the case that total ParserCache capacity exceeds the aggregate HTML size of pages on all wikis?
- If the above is true, then, before MCS is decommissioned, we may want to add an option to the REST API where clients can request that the request not be cached.
- Once that is done, Kiwix will need to update its code to pass in this new option