Address potential impact of Kiwix crawling on ParserCache before MCS decommissioning
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	ssastry
	Jun 16 2023, 1:55 PM

Description

NOTE: In this discussion below, we ignore commons and wikidata wikis since they aren't crawled by Kiwix

Kiwix crawls all pages on a wiki with some periodicity (1 week, many weeks?) as part of preparing offline versions of wikipedias (and other wikis). If these Kiwix requests filters down to an API that is backed by the ParserCache, this could have the inadvertent effect of causing infrequently accessed pages to be stuffed into ParserCache and potentially causing it to run of of space if the collective ParserCache disk space is not big enough to hold the entire content of all wikis that Kiwix crawls.

Thus far, Kiwix has been hitting MCS for Parsoid content, and the core API for getting additional missing metadata from the legacy parser content. MCS requests filtered down to RESTBase and came from RESTBase storage. But, the metadata request to the core API would likely have caused those pages to get stuffed into the ParserCache. Assuming this is correct, then Kiwix has been causing ParserCache to fill up with all pages of a wiki for a while and somehow this hasn't cause ParserCache storage to not blow up so far.

But, MCS is on its way to being decommissioned. At that time, Kiwix will switch to calling the REST API for getting Parsoid content. This can now cause Parsoid HTML for all pages of a wiki to get stuffed into ParserCache as well in addition to the legacy HTML version as well. This could potentially make things worse.

So, we need a few things to happen here:

Is it true that currently, legacy HTML for all pages of a wiki are getting stuffed into ParserCache? If so, is the case that total ParserCache capacity exceeds the aggregate HTML size of pages on all wikis?
If the above is true, then, before MCS is decommissioned, we may want to add an option to the REST API where clients can request that the request not be cached.
Once that is done, Kiwix will need to update its code to pass in this new option

Related Objects

Mentioned Here: T324866: large amount of traffic to the action=parse API from MWOffliner

Event Timeline

ssastry created this task.Jun 16 2023, 1:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 16 2023, 1:55 PM

ihurbain renamed this task from Address potential impact of Kiwis crawling on ParserCache before MCS decommissioning to Address potential impact of Kiwix crawling on ParserCache before MCS decommissioning.Jun 16 2023, 2:07 PM

RhinosF1 subscribed.Jun 16 2023, 2:22 PM

Ladsgroup added a subscriber: Joe.Jun 16 2023, 2:32 PM

MSantos added a project: Content-Transform-Team-WIP.Jun 22 2023, 2:22 PM

MSantos moved this task from Needs Triage to Performance on the Parsoid board.

MSantos added a project: RESTBase Sunsetting.Jun 23 2023, 1:39 PM

MSantos moved this task from Unsorted to Infrastructure Pile on the RESTBase Sunsetting board.Aug 18 2023, 3:34 PM

MSantos added a project: affects-Kiwix-and-openZIM.

@ssastry AFAIK what is described seemscorrect to me. @vadim-kovalenko and @cscott would certainly be able to tell more as they both know MWoffliner and the Wikimedia backend infra.

Not sure this is relevant, but https://phabricator.wikimedia.org/T324866 might be considered as well.

Kelson moved this task from TRIAGE to TOP on the affects-Kiwix-and-openZIM board.Aug 27 2023, 3:26 AM

MSantos removed a project: Content-Transform-Team-WIP.Oct 2 2023, 2:46 PM

Address potential impact of Kiwix crawling on ParserCache before MCS decommissioningOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Address potential impact of Kiwix crawling on ParserCache before MCS decommissioning
Open, Needs TriagePublic
Actions