Page MenuHomePhabricator

Raise limit of $wgMaxArticleSize for Hebrew Wikisource
Open, MediumPublic

Description

The Maximum article size (AKA post-expand include size) is set to 2048Kb. This limit is configured by the $wgMaxArticleSize variable. We ask to raise the limit to 4096Kb for the Hebrew Wikisource. We already hit the limit with two heavily accessed pages: Income Tax Ordinance and Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. Those pages are rendered incorrectly due to the limit. Other pages, such as Transportation Regulations, 5721–1961, are expected to hit to the limit in the near future.

Breaking the legal text into sections is not felt to be a valid solution. Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character. Therefore the limit for Hebrew text is half of the limit of a Latin text of the same length.

Event Timeline

Tagging Performance-Team and SRE as this potentially has performance impacts (longer pages potentially take longer to parse etc), and could mean reaching request timeout limits.

Other teams like the parsing team and the DBA's (among others) might have an interest too.

While I understand the reason for the request, on other projects (like enwiki etc), generally splitting pages down has been the way forward.

I'm not saying "this cannot be done", more it needs a bit of discussion and input from other teams before doing it. Maybe some appropriate logging/monitoring put in place too.

So for anyone coming here to make (or deploy a patch), please don't do so (certainly not deploying) until it has been discussed and approved by the relevant parties.

The Maximum article size (AKA post-expand include size) is set to 2048Kb.

$wgMaxArticleSize (* 1024) is the limit used for the output of strlen against the page content.

It is also used for 'maxIncludeSize' (* 1024 for bytes) which becomes the Post‐expand include size in the NewPP report too.

I note it is slightly odd they're both the same... As a page that is mostly text (to the limit), but with a couple of (even simple) templates wil then potentially be cut off/incorrectly rendered too.

https://he.wikisource.org/w/index.php?title=%D7%A4%D7%A7%D7%95%D7%93%D7%AA_%D7%9E%D7%A1_%D7%94%D7%9B%D7%A0%D7%A1%D7%94&action=info

Page length (in bytes) 1,448,087

But also

<!--
NewPP limit report
Parsed by mw1366
Cached time: 20210220034915
Cache expiry: 2592000
Dynamic content: false
Complications: []
CPU time usage: 11.855 seconds
Real time usage: 12.048 seconds
Preprocessor visited node count: 256994/1000000
Post‐expand include size: 2095966/2097152 bytes
Template argument size: 736332/2097152 bytes
Highest expansion depth: 10/40
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 757/5000000 bytes
Lua time usage: 4.244/10.000 seconds
Lua memory usage: 1688902/52428800 bytes
Lua Profile:
    recursiveClone <mwInit.lua:41>                                  2220 ms       50.7%
    (for generator)                                                  580 ms       13.2%
    Scribunto_LuaSandboxCallback::getExpandedArgument                540 ms       12.3%
    type                                                             420 ms        9.6%
    Scribunto_LuaSandboxCallback::gsub                               240 ms        5.5%
    <mwInit.lua:41>                                                  100 ms        2.3%
    ?                                                                 60 ms        1.4%
    getExpandedArgument <mw.lua:165>                                  60 ms        1.4%
    chunk <יחידה:String>                                         40 ms        0.9%
    tostring                                                          40 ms        0.9%
    [others]                                                          80 ms        1.8%
Number of Wikibase entities loaded: 0/400
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 9274.816      1 -total
 40.32% 3739.270   3616 תבנית:ח:ת+
 34.08% 3161.148   3833 תבנית:ח:צמצום
 19.58% 1815.624   1691 תבנית:ח:תת
 17.29% 1603.391   1948 תבנית:ח:פנימי
 14.50% 1344.959   1166 תבנית:ח:תתת
 13.66% 1266.808    730 תבנית:ח:חיצוני
  9.19%  852.182    556 תבנית:ח:סעיף
  6.46%  598.916    503 תבנית:ח:תתתת
  3.26%  302.772    463 תבנית:ח:הערה
-->

Side note: We see a performance issue with the Module:String. The {{ח:צמצום}} template relies on {{#invoke:String|len|...}} which consumes most of CPU time.

Yeah, indeed.

I think this one has slightly different merit with the Hebrew chracters being two bytes per character etc (and it obviously would affect other wikis too; even if they haven't got to the point their articles are long enough to cause an issue, yet), so in theory, the length of the page (in terms of numbers of characters) could be half the size.

Maybe it's the start of a large discussion of either how we count it (maybe mb_strlen instead of strlen), or whether we increase it more globally/generally because of other perf improvements.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

In which case it seems reasonable that Hebrew would have twice the limit English does, which seems to be what is proposed here. Or switching to mb_strlen, as you described, would make the default limit multibyte-agnostic. That's probably a better solution than having every multibyte language use a doubled limit.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

Yeah, no idea either. @cscott, @ssastry any ideas on this one? :)

There are two apparent solutions to the effective limit for Hebrew pages:
A. Use mb_strlen instead of strlen to measure page size in characters rather than in bytes.
B. Use $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size.

As for the Hebrew Wikisource, the immediate workaround is to temporary raise the limit as requested in the bug description.

I'd recommend just temporarily bumping the limit for hewikisource for now. However, not that this is not equitable either: zhwiki for example should have 4x the character limit if this is to be the new rule. Unlike what is claimed above, many of the performance metrics *do* scale with bytes rather than characters -- most wikitext processing is at some point regexp-based, and that works on bytes (unicode characters are desugared to the appropriate byte sequences), and of course network bandwidth, database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

It's worth noting that article size limits are at least in part a *social* construct, not a purely technical issue. Limits were set deliberately to restrict the size of articles to encourage splitting articles when they get too large to be readable. Of course wikisource is a different sort of thing, where the expectation is that the article is faithful to the original source document. But we shouldn't ignore the social implications of increasing article size limits on certain wikis, and the knock-on effects on article structure. This is mostly *not* a technical issue.

The parsoid-specific issue here is T239841: Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow; we actually deliberately changed Parsoid to be consistent with core. In part this was to address the performance implication of running mb_strlen multiple times; unlike strlen which is O(1) due to the way PHP represents strings, mb_strlen is O(length of string).

The broader question is T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM. Core uses a grab bag of metrics as approximate proxies for "parsing and storage time and space complexity", to ensure that articles compliant with these simpler metrics don't cause OOMs or excessive time spent handling requests. But these are approximations, and they don't always map well to the performance profile of Parsoid on the same page. Eventually we'll have to reconcile these, and the result may be limits based on markup complexity, not simple string length or character count. (But of course database space and intracluster network bandwidth are still fundementally byte-based!)

As partial solution, it is possible to use 2*$wgMaxArticleSize as limit for the page post-expand include size? The wgMaxArticleSize will continue to be used as limit for the page raw size, before template expanding.

Note that $wgAPIMaxResultSize has a comment that says it depends on $wgMaxArticleSize, so if you bumped $wgMaxArticleSize you probably need to bump $wgAPIMaxResultSize as well. There might also be DB schema implications, I don't know. I'd be cautious and check in broadly before making changes here.

I stumbled upon a similar problem, with the Israeli Plant Protection Regulations (Plant Import, Plant Products, Pests and Regulated Articles), 5769–2009. A bidirectional template was using {{#if:{{{1|}}}{{{2|}}} | ... }} to check if the first and second parameters are given. The test is based on templates expansion, which, in this case, almost doubled the post-expand include size of the article. I had to remove the condition statement in order for the regulation file to remain within limits.

Is there any efficient way to check if parameter is given, without contributing to the post-expansion size?

Are there any updates on the issue? I'm playing around the limit by removing functionality from some templates, but I cannot dodge the limit much longer...

Oh wow, this was open for more than a year ago. Why it hasn't been done yet?

  1. There is a consensus among editors that ot should be done.
  2. There is no technical challenges associated with this solution. It's just a small change in LocalSettings.php of every wiki that asks for it.

We don't use LocalSettings.php on Wikimedia wikis ;)

Ok, I didn't know that for Wikimedia wikis you use different kind of settings. But I assume it works like that: there is a variable $wgMaxArticleSize that you can tweak for each individual project, right?

Why can't you increase the limit for the projects that ask for it? More specifically, only for some Wikisource wikis that already experiencing some problems because of this misconfiguration?

Can I do it myself? If yes, can you point me the file I should change?

The limit exists for many reasons. If the limit had no purpose and it was fine to always raise, it would not exist.

The work to be done in this case is not so much about changing or adding one line of text in a configuration file, that's the easy part, and at this time nobody yet would have started thinking about doing that as that's trivial and not where the complexity resides. We make configuration changes every day (source).

Rather, the work to be done here is to understand the needs and figure out what a sustainable and responsible way to address that need is. If it turns out that the appropiate and adequate solution is to raise the limit, then that can be done. However, it is not obvious to me that raising the limit would be possible right now, nor is it obvious to me that it would actually help you in the long-term.

Technical conversations like these work best when we start from understanding and solving a problem, optionally with suggested solution if you believe or think there is something at hand that could solve it, and then working towards a useful outcome.

I appreciate that it is annoying when a limit like this is reached, but in general we decline these requests and instead recommend that the solution come from within the communinty, by organising the infromation such that the limit is not reached.

If you have run a wiki yourself, it might seem trivial to raise when you have only 1 active user (yourself), a handful of pages, and enough RAM to serve 1 person at a time to save and render whatever the biggest page is that you can produce alone. But, things get a bit more.. complicated, when we're talking about operating a thousand wikis from a few hundred servers with a set amount of RAM, and then actively serve an audience of billions without causing significant risk or delays to the infrastructure.

Some things to consider:

  • More memory consumed everytime it is interacted with by the software.
  • Every person has to wait and pay for downloading it over their connection, and parsing it bit-by-bit on their device.
  • Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.
  • Larger articles take longer to process within the software, whether for JavaScript interactions, or server-side MobileFormatter, server-side DiscussionTools, PDF rendering, etc.
  • Larger articles are slower to edit, as they are downloaded and uploaded every time.

Clearly we don't have infinite memory, bandwidth, or funds, and can't time-travel, hence there is a limit.

Question: Is there specific and limited concrete use case here that happpens to require a slightly higher limit to work? That is, is there some new structure or content type that has evolved that basically always needs a larger size to work, but is limited?

If yes, then I'd recommend describing that use case and what size you expect to need for content of that type, and then see if we can plan our next round of hardware purchases and UX design to fit that size.

If not, then I'm afraid this is just "the same as before" but deferring the task of solving an underlying problem in the information by raising the limit, which has no predictable end to it. I don't think we should raise it unless there is reasonable understanding of how big we need, why, and that when that limit is reached that we decline raising it again because the content was structured incorrectly. By that same logic we could then think that perhaps that is already where we are today and thus might decline this suggestion the same way.

In other words, if a new Wiksource page would require 100x more bytes than the current limit, what has the community come up with as the solution for that? Would that same solution work here? How much is enough? And why?

Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character.

To my knowledge most, if not all, of the above reasons relate to digital bytes and not visible characters.

Why not using $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size?

More memory consumed everytime it is interacted with by the software.

Isn't it cached after being generated once? In Wikisource, contrary to Wikipedia, pages are not being edited so frequently. So once the page is generated, no CPU should be used to show it (except some templates or linked files were changed). But memory maybe.

Every person has to wait and pay for downloading it over their connection, and parsing it bit-by-bit on their device.

The page size doesn't increase traffic consumption that much since text is transferred gzipped. Contemporary browsers have a lot of optimizations under the hood to be able to show really long pages without lags. They can even start show the start of the page before it completely downloaded (sometimes).

Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.

True for Wikipedia, but not for Wikisource. Imagine you make some research about a book and you need to find in which chapter a character named Bob was first introduced. If the whole book is loaded on the same page, then simple built in Page Search would work perfectly. Now imagine how to solve that task if every chapter is on their own page.

Larger articles are slower to edit, as they are downloaded and uploaded every time.

Again, it's not a big difference because of GZIP. To upload a 4 MB book you need to send only 1 MB of data.

According to this research: https://almanac.httparchive.org/en/2020/page-weight - in 2020 average page size (decompressed) is about 6-7 MB.

Another consideration: the alternative to a page with 4 MB of text is not a page with 1 MB of text, but rather 4 pages with 1 MB of text each.

When you think about this this way, then should be almost no difference nor for Wikimedia servers, nor for the user.

User Vladis13 did a great job importing some public domain texts in Russian Wikisource. He also prepared a folder with texts that were too big: https://disk.yandex.ru/d/QjFXyEiY0t7Qvw

I analyzed that folder and created this chart with file size buckets:

Screenshot 2022-05-21 at 21.47.22.png (680×1 px, 71 KB)

(you can see the chart and data in this spreadsheet: https://docs.google.com/spreadsheets/d/18sh9wyqzUg9MYbJpzrcq77UOR8B7MfTN5GpE1yyUFbE/edit?usp=sharing)

Conclusion: If the maximum article limit will be increased to 4 MB, he can automatically upload 314 books into Russian Wikisource. There are still 14 books that are larger than 4 mb, but at this point we can deal with those manually and split them into parts. But it's impossible to do it 300 books.

Thanks. I'll expand my comment with some additional concerns whilst also re-considering the option in context of narrowed application for Wikisource.

Snaevar added a subscriber: Snaevar.

Question: Is there specific and limited concrete use case here that happpens to require a slightly higher limit to work? That is, is there some new structure or content type that has evolved that basically always needs a larger size to work, but is limited?

The usecase that commonly puts wikisource against the limit is that wikisource users work on books one book page at a time. Each book page has its own wikipage in the "Page" namespace. Then once each book chapter is done, each of the wikipages of the chapter (from the Page namespace) are transcluded to the main namespace into an wikipage for each chapter. This also explains why it is so easy for wikisource to get to the limit. So the actual need depends on the biggest chapter of each book.

Made T309568 as an possible fix.

database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

The database schema do not require modification. The database text column is of type binary MEDIUMBLOB, GZIPed diffs are written there. The MEDIUMBLOB type can store up to 16Mb of data per entry, i.e. this is the diff size limit for any page edit (revision).
The number of page edits is not limited, all diffs are collected together only when the page is rendered. If we recall the words of the documentation "The compression ratio achieved on Wikimedia sites nears 98%." (DB doc), then it can be argued that the current database schema can store pages of infinite size.

Some things to consider:

  • Larger articles take longer to process within the software, whether for JavaScript interactions, or server-side MobileFormatter, server-side DiscussionTools, PDF rendering, etc.

As I wrote on T308893, I see the only difficulty here. - This is a syntax highlighting gadget in the browser editor that is automatically disabled on large pages. It is already unstable on relatively small pages with lots of tags/templates and/or when they are not correctly closed during editing.

MobileFormatter.
Modern e-readers with WiFi are usually equipped with old Android 4.4 and strange modifications of the Chrome browser that do not work with the JavaScript of most web pages (the browser crashes even on the google.com page). This is the situation with my OnyxBook Volta. However, I can easily open the largest page ruwikisource T308893#7954437 in mobile and PC versions.

DiscussionTools.
Here the issue is only about the Main namespace in Wikisources. I doubt this will be needed in the discussion pages and other Wikimedia projects.

PDF rendering
To be honest, I don't know who uses it in Wikisource. Because it is a very heavy format with large files. PDF is extremely inconvenient to read on mobile devices and PCs because it has a fixed page image size.
On the contrary, the main work of Wikisource users is to convert PDF to text. For those interested in the PDF, Wikisource has links to Commons and library sites.