Page MenuHomePhabricator

Reader quickly reads a wiki page
Closed, ResolvedPublic

Description

"As a Reader of a Wikimedia site, I want to quickly get the HTML representation of a common page, and not have to wait for a parse to happen because the parser cache is full of a lot of other data."

I had a hard time formulating this as a user story. Roughly, @cscott estimates that Parsoid output for a page will be about 3x the size of default output, counting the metadata blobs. That means that we might hold 4x the data in the cache for a single page. My completely uneducated guess is that this would hurt cache performance.

Event Timeline

I don't think Parsoid output for a page is 3x. HTML blob is roughly the same size as current parser HTML size. data-parsoid and data-mw can be, in the worst case double or triple, but that is the real worst case. We can fairly easily gather stats (we may even have it somewhere in some old task). But, I think for planning purposes right now, you could assume it 2x across all blobs maybe.

I imagine MPC (the acronym I've used for your multi-parser-cache proposal) is just an interface (much like how RESTBase is) but backed by different storage components and so you could very well use existing storage if it helps. Anyway, I don't need to wade into those specifics right now. :-) I just wanted to clarify that the (not necessarily yours) 3x estimate is probably too high. And, if required for planning, we can readily collect more accurate stats by instrumenting Parsoid or maybe @Pchelolo can get them from Cassandra storage.

Is this task about capacity planning for ParserCache when we incorporate Parsoid output there? What does it have to do with a reader and reading a page quickly? Can we rename/rewrite the task to accurately represent what it is about or you need it for some processes or something? In that case I'd create another task for capacity planning.