Page MenuHomePhabricator

Address potential impact of Kiwix crawling on ParserCache before MCS decommissioning
Open, Needs TriagePublic

Description

NOTE: In this discussion below, we ignore commons and wikidata wikis since they aren't crawled by Kiwix

Kiwix crawls all pages on a wiki with some periodicity (1 week, many weeks?) as part of preparing offline versions of wikipedias (and other wikis). If these Kiwix requests filters down to an API that is backed by the ParserCache, this could have the inadvertent effect of causing infrequently accessed pages to be stuffed into ParserCache and potentially causing it to run of of space if the collective ParserCache disk space is not big enough to hold the entire content of all wikis that Kiwix crawls.

Thus far, Kiwix has been hitting MCS for Parsoid content, and the core API for getting additional missing metadata from the legacy parser content. MCS requests filtered down to RESTBase and came from RESTBase storage. But, the metadata request to the core API would likely have caused those pages to get stuffed into the ParserCache. Assuming this is correct, then Kiwix has been causing ParserCache to fill up with all pages of a wiki for a while and somehow this hasn't cause ParserCache storage to not blow up so far.

But, MCS is on its way to being decommissioned. At that time, Kiwix will switch to calling the REST API for getting Parsoid content. This can now cause Parsoid HTML for all pages of a wiki to get stuffed into ParserCache as well in addition to the legacy HTML version as well. This could potentially make things worse.

So, we need a few things to happen here:

  • Is it true that currently, legacy HTML for all pages of a wiki are getting stuffed into ParserCache? If so, is the case that total ParserCache capacity exceeds the aggregate HTML size of pages on all wikis?
  • If the above is true, then, before MCS is decommissioned, we may want to add an option to the REST API where clients can request that the request not be cached.
  • Once that is done, Kiwix will need to update its code to pass in this new option

Event Timeline

ihurbain renamed this task from Address potential impact of Kiwis crawling on ParserCache before MCS decommissioning to Address potential impact of Kiwix crawling on ParserCache before MCS decommissioning.Jun 16 2023, 2:07 PM

@ssastry AFAIK what is described seemscorrect to me. @vadim-kovalenko and @cscott would certainly be able to tell more as they both know MWoffliner and the Wikimedia backend infra.

Not sure this is relevant, but https://phabricator.wikimedia.org/T324866 might be considered as well.