Page MenuHomePhabricator

Collect data and correlate Parsoid memory usage with legacy parser memory usage
Open, MediumPublic

Description

In order to eventually come up with appropriate resource limits for the parsoid parser (T254522), we first need to understand how Parsoid's memory usage scales with various wikitext measures.

The first step is probably to construct various synthetic benchmarks, with text of various sizes, lists of various sizes, tables of various sizes, figures of various sizes, etc, so we can determine where to set resource limits. These would be 'clean' scaling numbers.

Alternatively we could start collecting statistics on real-world wikitext, and try to do a "big data" numerical analysis on the 'dirty' data to determine (a) which properties of input wikitext are strong predictors of CPU time usage and memory usage, and (b) what typical ranges of these properties are for existing articles.

The goal is to get some confidence that we can (say) set limits on input wikitext to X total bytes, Y lists, Z table cells, etc, and have Parsoid almost certain to complete processing in less than A CPU seconds and B MB heap size, and that 99.<some number of nines>% of existing wikitext content falls under these limits.

Event Timeline

ssastry moved this task from Needs Triage to Performance on the Parsoid board.

From a prior-art perspective, this is the limit report from the current parser for the en.wp main page, squished slightly.

<!-- 
NewPP limit report, Parsed by mw1405, Cached time: 20201031042327, Cache expiry: 3600, Dynamic content: true, Complications: []
CPU time usage: 0.420 seconds, Real time usage: 0.559 seconds
Preprocessor visited node count: 4244/1000000
Post‐expand include size: 114562/2097152 bytes
Template argument size: 8510/2097152 bytes
Highest expansion depth: 21/40
Expensive parser function count: 15/500
Unstrip recursion depth: 0/20, Unstrip post‐expand size: 4875/5000000 bytes
Lua time usage: 0.101/10.000 seconds, Lua memory usage: 2.75 MB/50 MB
Number of Wikibase entities loaded: 0/400
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%  410.449      1 -total
 37.63%  154.471      1 Wikipedia:Main_Page/Tomorrow
 36.21%  148.615      8 Template:Main_page_image
 22.15%   90.896      8 Template:Str_number/trim
 18.85%   77.354     24 Template:If_empty
 18.25%   74.906      2 Template:Main_page_image/TFA
 17.33%   71.142      1 Wikipedia:Today's_featured_article/October_31,_2020
 14.41%   59.151      2 Template:Wikipedia_languages
 13.83%   56.745      1 Wikipedia:Today's_featured_article/November_1,_2020
 12.55%   51.531      1 Wikipedia:Selected_anniversaries/October_31
-->

Special:TrackingCategories has a list of tracking categories for pages on wiki that breach several of those limits (either some don't have categories coincidentally, like the Lua ones), the interesting ones are Expensive parser function calls, Expansion depth exceeded, Node count exceeded, Omitted template arguments, Template include size exceeded.

From anecdotal experience, I think the existing limits that our community trips its feet over regularly today, and which we have issues working with, are related to the transclusion limits, especially the post-transclusion size limit. For example, there are recurring discussions on how best to deal with pathological cases like the Syrian Civil War map, less-pathological like Template:Clade (i.e. we need better graphing and mapping support than we have, though that's not in your boat), and even a discussion right now on en.wp about lifting the limit due to a template regarding COVID 19 which is transcluding 300 references all by itself (the obvious workaround, not transcluding the references, has not been employed for whatever reason :).

Ignoring the pathological case of that template, references are usually the heaviest for transclusion content in general on en.wp. Today, the citation templates output the visible page data, the HTML markup, one TemplateStyles sheet (deduplicated currently, of course), and finally a COINS representation of the data (for ease of use in Zotero, among other consumers). When you have N_ref > 500 or so, you start getting into the realm of "this page is both big in content and in referencing" which is a double whammy (and which usually indicates some level approaching completion. or approaching chaos and disorganized topic areas. one of the two :^). There are some ways to work around this for 'older' topics which may rely more on book referencing with both its existing-today workarounds available in templates like [[Template:Sfn]], as well as the forthcoming book referencing in the Cite extension, but topics in the news have many more references that can't be dealt with except by including a lot of references to news sources, which rarely have much citation overlap. (And sometimes you just have large, complex topics like World War II, which even given a plethora of shortened citations inside the references tag-proper has another 250 cite book templates.)

Oftentimes wikis will turn to less-templated options rather than split articles to 'fit' stuff onto some number of pages. This is categorically something we prefer not to occur, for obvious reasons. Sometimes this is a shift to native wikitext from templated text (i.e. {{reflist}} versus <references/> -> people do still use the former since it still has output options unavailable to the native construct; or, taking table row templates and turning them into normal wikitext tables -- almost required on any large table, even though it would be nice to have cross-article consistency in appearance as well as ensuring things like proper accessibility).

Wikisource has a use case that doesn't have a good workaround, which relies on transclusion to provide "All-book" book webpages (such as https://en.wikisource.org/wiki/A_History_of_the_Theories_of_Aether_and_Electricity/FullText ); Wikisource even requested relief on the transclusion limit at a CommTech survey. En.wp has some of those "all" pages but for the most part they are little-accessed, like WP:VPALL, precisely due to the transclusion size limit breaking half or more of the content. Contrasting, this has limited the utility of some community pages like WP:AFD and [[WP:FAC]] into not transcluding all of the currently-open nominations for deletion, which are regularly accessed.) I guess these limits have shaped other parts of the wiki, like having complete lists of content be split up alphabetically i.e. there is not single "list of video games". (There are reasonable arguments for/against the notion.)

Tables are usually the next problem child, and a lot of the time both for the overlap with the above (sometimes tables are implemented as table row templates but regardless they almost always have at least one reference per row). Tables are moreover a heavy structure in most cases to deal with both in parsing and in memory use I would expect given their usual length and the quantity of HTML associated with a table (contrast with a list using list markup and some commas). Special:LongPages is majority lists and tables... While we have at least one policy (WP:SIZE) hanging around for "make pages smaller than Arbitrary Self-Imposed Limit", sometimes there's just no reasonable split that meets our page inclusion objectives outlined in WP:NOTABILITY and other policies/guidelines. Of course LongPages tracks pages by wikitext count; it would be cool to see "longest pages by HTML output and/or heaviest pages including templates, styling, and images" if not onwiki than in some other perhaps-Toolforge tool for tracking such on each wiki. (I expect you'll end up generating something interesting in that direction regardless to answer these questions.)

Navboxes are kind of both issues but aren't really talked to.

I imagine if we made greater Wikidata use (e.g. providing a full taxonomy in [[Template:Taxobox]] provided by Wikidata Wikidata), expensive parser function calls would increase or trip the threshold, or if someone should come along and be able to embed some limited SQL-like system to access Wikidata that lists would get long and hard to access, but those are future needs I suppose.

A lot of this probably points to "native utilities implemented in PHP or in a C extension in wikitext form would help a lot", but that has its own issues both community side (not per se NIMBY but definitely "we need more utility than what you put in this wikitext you have invented for us"... as well as quickly diminishing returns since en.wp has the largest pages in general) as well as WMF side (you employ how many people to make software? oh yes, never enough :).

I am certainly not arguing for an increase in that particular limit myself--people will find ways to fill the available space, whether this limit or another. I mostly anticipate that these pages are the ones you can look to first and say "how do we do?" from a real-world perspective. Which I would definitely recommend taking on as such first, at least for en.wp.

Sorry for the short work on this. :^)