Page MenuHomePhabricator

Raise limit of $wgMaxArticleSize for Hebrew Wikisource
Open, MediumPublic

Description

The Maximum article size (AKA post-expand include size) is set to 2048Kb. This limit is configured by the $wgMaxArticleSize variable. We ask to raise the limit to 4096Kb for the Hebrew Wikisource. We already hit the limit with two heavily accessed pages: Income Tax Ordinance and Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. Those pages are rendered incorrectly due to the limit. Other pages, such as Transportation Regulations, 5721–1961, are expected to hit to the limit in the near future.

Breaking the legal text into sections is not felt to be a valid solution. Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character. Therefore the limit for Hebrew text is half of the limit of a Latin text of the same length.

Event Timeline

Tagging Performance-Team and SRE as this potentially has performance impacts (longer pages potentially take longer to parse etc), and could mean reaching request timeout limits.

Other teams like the parsing team and the DBA's (among others) might have an interest too.

While I understand the reason for the request, on other projects (like enwiki etc), generally splitting pages down has been the way forward.

I'm not saying "this cannot be done", more it needs a bit of discussion and input from other teams before doing it. Maybe some appropriate logging/monitoring put in place too.

So for anyone coming here to make (or deploy a patch), please don't do so (certainly not deploying) until it has been discussed and approved by the relevant parties.

The Maximum article size (AKA post-expand include size) is set to 2048Kb.

$wgMaxArticleSize (* 1024) is the limit used for the output of strlen against the page content.

It is also used for 'maxIncludeSize' (* 1024 for bytes) which becomes the Post‐expand include size in the NewPP report too.

I note it is slightly odd they're both the same... As a page that is mostly text (to the limit), but with a couple of (even simple) templates wil then potentially be cut off/incorrectly rendered too.

https://he.wikisource.org/w/index.php?title=%D7%A4%D7%A7%D7%95%D7%93%D7%AA_%D7%9E%D7%A1_%D7%94%D7%9B%D7%A0%D7%A1%D7%94&action=info

Page length (in bytes) 1,448,087

But also

<!--
NewPP limit report
Parsed by mw1366
Cached time: 20210220034915
Cache expiry: 2592000
Dynamic content: false
Complications: []
CPU time usage: 11.855 seconds
Real time usage: 12.048 seconds
Preprocessor visited node count: 256994/1000000
Post‐expand include size: 2095966/2097152 bytes
Template argument size: 736332/2097152 bytes
Highest expansion depth: 10/40
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 757/5000000 bytes
Lua time usage: 4.244/10.000 seconds
Lua memory usage: 1688902/52428800 bytes
Lua Profile:
    recursiveClone <mwInit.lua:41>                                  2220 ms       50.7%
    (for generator)                                                  580 ms       13.2%
    Scribunto_LuaSandboxCallback::getExpandedArgument                540 ms       12.3%
    type                                                             420 ms        9.6%
    Scribunto_LuaSandboxCallback::gsub                               240 ms        5.5%
    <mwInit.lua:41>                                                  100 ms        2.3%
    ?                                                                 60 ms        1.4%
    getExpandedArgument <mw.lua:165>                                  60 ms        1.4%
    chunk <יחידה:String>                                         40 ms        0.9%
    tostring                                                          40 ms        0.9%
    [others]                                                          80 ms        1.8%
Number of Wikibase entities loaded: 0/400
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 9274.816      1 -total
 40.32% 3739.270   3616 תבנית:ח:ת+
 34.08% 3161.148   3833 תבנית:ח:צמצום
 19.58% 1815.624   1691 תבנית:ח:תת
 17.29% 1603.391   1948 תבנית:ח:פנימי
 14.50% 1344.959   1166 תבנית:ח:תתת
 13.66% 1266.808    730 תבנית:ח:חיצוני
  9.19%  852.182    556 תבנית:ח:סעיף
  6.46%  598.916    503 תבנית:ח:תתתת
  3.26%  302.772    463 תבנית:ח:הערה
-->

Side note: We see a performance issue with the Module:String. The {{ח:צמצום}} template relies on {{#invoke:String|len|...}} which consumes most of CPU time.

Yeah, indeed.

I think this one has slightly different merit with the Hebrew chracters being two bytes per character etc (and it obviously would affect other wikis too; even if they haven't got to the point their articles are long enough to cause an issue, yet), so in theory, the length of the page (in terms of numbers of characters) could be half the size.

Maybe it's the start of a large discussion of either how we count it (maybe mb_strlen instead of strlen), or whether we increase it more globally/generally because of other perf improvements.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

In which case it seems reasonable that Hebrew would have twice the limit English does, which seems to be what is proposed here. Or switching to mb_strlen, as you described, would make the default limit multibyte-agnostic. That's probably a better solution than having every multibyte language use a doubled limit.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

Yeah, no idea either. @cscott, @ssastry any ideas on this one? :)

There are two apparent solutions to the effective limit for Hebrew pages:
A. Use mb_strlen instead of strlen to measure page size in characters rather than in bytes.
B. Use $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size.

As for the Hebrew Wikisource, the immediate workaround is to temporary raise the limit as requested in the bug description.

I'd recommend just temporarily bumping the limit for hewikisource for now. However, not that this is not equitable either: zhwiki for example should have 4x the character limit if this is to be the new rule. Unlike what is claimed above, many of the performance metrics *do* scale with bytes rather than characters -- most wikitext processing is at some point regexp-based, and that works on bytes (unicode characters are desugared to the appropriate byte sequences), and of course network bandwidth, database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

It's worth noting that article size limits are at least in part a *social* construct, not a purely technical issue. Limits were set deliberately to restrict the size of articles to encourage splitting articles when they get too large to be readable. Of course wikisource is a different sort of thing, where the expectation is that the article is faithful to the original source document. But we shouldn't ignore the social implications of increasing article size limits on certain wikis, and the knock-on effects on article structure. This is mostly *not* a technical issue.

The parsoid-specific issue here is T239841: Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow; we actually deliberately changed Parsoid to be consistent with core. In part this was to address the performance implication of running mb_strlen multiple times; unlike strlen which is O(1) due to the way PHP represents strings, mb_strlen is O(length of string).

The broader question is T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM. Core uses a grab bag of metrics as approximate proxies for "parsing and storage time and space complexity", to ensure that articles compliant with these simpler metrics don't cause OOMs or excessive time spent handling requests. But these are approximations, and they don't always map well to the performance profile of Parsoid on the same page. Eventually we'll have to reconcile these, and the result may be limits based on markup complexity, not simple string length or character count. (But of course database space and intracluster network bandwidth are still fundementally byte-based!)

As partial solution, it is possible to use 2*$wgMaxArticleSize as limit for the page post-expand include size? The wgMaxArticleSize will continue to be used as limit for the page raw size, before template expanding.

Note that $wgAPIMaxResultSize has a comment that says it depends on $wgMaxArticleSize, so if you bumped $wgMaxArticleSize you probably need to bump $wgAPIMaxResultSize as well. There might also be DB schema implications, I don't know. I'd be cautious and check in broadly before making changes here.