Page MenuHomePhabricator

Raise limit of $wgMaxArticleSize for Hebrew Wikisource
Open, MediumPublic

Description

The Maximum article size (AKA post-expand include size) is set to 2048Kb. This limit is configured by the $wgMaxArticleSize variable. We ask to raise the limit to 4096Kb for the Hebrew Wikisource. We already hit the limit with two heavily accessed pages: Income Tax Ordinance and Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. Those pages are rendered incorrectly due to the limit. Other pages, such as Transportation Regulations, 5721–1961, are expected to hit to the limit in the near future.

Breaking the legal text into sections is not felt to be a valid solution. Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character. Therefore the limit for Hebrew text is half of the limit of a Latin text of the same length.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The Maximum article size (AKA post-expand include size) is set to 2048Kb.

$wgMaxArticleSize (* 1024) is the limit used for the output of strlen against the page content.

It is also used for 'maxIncludeSize' (* 1024 for bytes) which becomes the Post‐expand include size in the NewPP report too.

I note it is slightly odd they're both the same... As a page that is mostly text (to the limit), but with a couple of (even simple) templates wil then potentially be cut off/incorrectly rendered too.

https://he.wikisource.org/w/index.php?title=%D7%A4%D7%A7%D7%95%D7%93%D7%AA_%D7%9E%D7%A1_%D7%94%D7%9B%D7%A0%D7%A1%D7%94&action=info

Page length (in bytes) 1,448,087

But also

<!--
NewPP limit report
Parsed by mw1366
Cached time: 20210220034915
Cache expiry: 2592000
Dynamic content: false
Complications: []
CPU time usage: 11.855 seconds
Real time usage: 12.048 seconds
Preprocessor visited node count: 256994/1000000
Post‐expand include size: 2095966/2097152 bytes
Template argument size: 736332/2097152 bytes
Highest expansion depth: 10/40
Expensive parser function count: 0/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 757/5000000 bytes
Lua time usage: 4.244/10.000 seconds
Lua memory usage: 1688902/52428800 bytes
Lua Profile:
    recursiveClone <mwInit.lua:41>                                  2220 ms       50.7%
    (for generator)                                                  580 ms       13.2%
    Scribunto_LuaSandboxCallback::getExpandedArgument                540 ms       12.3%
    type                                                             420 ms        9.6%
    Scribunto_LuaSandboxCallback::gsub                               240 ms        5.5%
    <mwInit.lua:41>                                                  100 ms        2.3%
    ?                                                                 60 ms        1.4%
    getExpandedArgument <mw.lua:165>                                  60 ms        1.4%
    chunk <יחידה:String>                                         40 ms        0.9%
    tostring                                                          40 ms        0.9%
    [others]                                                          80 ms        1.8%
Number of Wikibase entities loaded: 0/400
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00% 9274.816      1 -total
 40.32% 3739.270   3616 תבנית:ח:ת+
 34.08% 3161.148   3833 תבנית:ח:צמצום
 19.58% 1815.624   1691 תבנית:ח:תת
 17.29% 1603.391   1948 תבנית:ח:פנימי
 14.50% 1344.959   1166 תבנית:ח:תתת
 13.66% 1266.808    730 תבנית:ח:חיצוני
  9.19%  852.182    556 תבנית:ח:סעיף
  6.46%  598.916    503 תבנית:ח:תתתת
  3.26%  302.772    463 תבנית:ח:הערה
-->

Side note: We see a performance issue with the Module:String. The {{ח:צמצום}} template relies on {{#invoke:String|len|...}} which consumes most of CPU time.

Yeah, indeed.

I think this one has slightly different merit with the Hebrew chracters being two bytes per character etc (and it obviously would affect other wikis too; even if they haven't got to the point their articles are long enough to cause an issue, yet), so in theory, the length of the page (in terms of numbers of characters) could be half the size.

Maybe it's the start of a large discussion of either how we count it (maybe mb_strlen instead of strlen), or whether we increase it more globally/generally because of other perf improvements.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

In which case it seems reasonable that Hebrew would have twice the limit English does, which seems to be what is proposed here. Or switching to mb_strlen, as you described, would make the default limit multibyte-agnostic. That's probably a better solution than having every multibyte language use a doubled limit.

The fact that each character takes twice the storage space shouldn't affect parsing complexity and time, right? I'm not familiar with out parsing code, but I don't imagine it would do any sub-character processing.

Yeah, no idea either. @cscott, @ssastry any ideas on this one? :)

There are two apparent solutions to the effective limit for Hebrew pages:
A. Use mb_strlen instead of strlen to measure page size in characters rather than in bytes.
B. Use $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size.

As for the Hebrew Wikisource, the immediate workaround is to temporary raise the limit as requested in the bug description.

I'd recommend just temporarily bumping the limit for hewikisource for now. However, not that this is not equitable either: zhwiki for example should have 4x the character limit if this is to be the new rule. Unlike what is claimed above, many of the performance metrics *do* scale with bytes rather than characters -- most wikitext processing is at some point regexp-based, and that works on bytes (unicode characters are desugared to the appropriate byte sequences), and of course network bandwidth, database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

It's worth noting that article size limits are at least in part a *social* construct, not a purely technical issue. Limits were set deliberately to restrict the size of articles to encourage splitting articles when they get too large to be readable. Of course wikisource is a different sort of thing, where the expectation is that the article is faithful to the original source document. But we shouldn't ignore the social implications of increasing article size limits on certain wikis, and the knock-on effects on article structure. This is mostly *not* a technical issue.

The parsoid-specific issue here is T239841: Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow; we actually deliberately changed Parsoid to be consistent with core. In part this was to address the performance implication of running mb_strlen multiple times; unlike strlen which is O(1) due to the way PHP represents strings, mb_strlen is O(length of string).

The broader question is T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM. Core uses a grab bag of metrics as approximate proxies for "parsing and storage time and space complexity", to ensure that articles compliant with these simpler metrics don't cause OOMs or excessive time spent handling requests. But these are approximations, and they don't always map well to the performance profile of Parsoid on the same page. Eventually we'll have to reconcile these, and the result may be limits based on markup complexity, not simple string length or character count. (But of course database space and intracluster network bandwidth are still fundementally byte-based!)

As partial solution, it is possible to use 2*$wgMaxArticleSize as limit for the page post-expand include size? The wgMaxArticleSize will continue to be used as limit for the page raw size, before template expanding.

Note that $wgAPIMaxResultSize has a comment that says it depends on $wgMaxArticleSize, so if you bumped $wgMaxArticleSize you probably need to bump $wgAPIMaxResultSize as well. There might also be DB schema implications, I don't know. I'd be cautious and check in broadly before making changes here.

I stumbled upon a similar problem, with the Israeli Plant Protection Regulations (Plant Import, Plant Products, Pests and Regulated Articles), 5769–2009. A bidirectional template was using {{#if:{{{1|}}}{{{2|}}} | ... }} to check if the first and second parameters are given. The test is based on templates expansion, which, in this case, almost doubled the post-expand include size of the article. I had to remove the condition statement in order for the regulation file to remain within limits.

Is there any efficient way to check if parameter is given, without contributing to the post-expansion size?

Are there any updates on the issue? I'm playing around the limit by removing functionality from some templates, but I cannot dodge the limit much longer...

Oh wow, this was open for more than a year ago. Why it hasn't been done yet?

  1. There is a consensus among editors that ot should be done.
  2. There is no technical challenges associated with this solution. It's just a small change in LocalSettings.php of every wiki that asks for it.

We don't use LocalSettings.php on Wikimedia wikis ;)

Ok, I didn't know that for Wikimedia wikis you use different kind of settings. But I assume it works like that: there is a variable $wgMaxArticleSize that you can tweak for each individual project, right?

Why can't you increase the limit for the projects that ask for it? More specifically, only for some Wikisource wikis that already experiencing some problems because of this misconfiguration?

Can I do it myself? If yes, can you point me the file I should change?

The limit exists for many reasons. If the limit had no purpose and it was fine to always raise, it would not exist.

The work to be done in this case is not so much about changing or adding one line of text in a configuration file, that's the easy part, and at this time nobody yet would have started thinking about doing that as that's trivial and not where the complexity resides. We make configuration changes every day (source).

Rather, the work to be done here is to understand the needs and figure out what a sustainable and responsible way to address that need is. If it turns out that the appropiate and adequate solution is to raise the limit, then that can be done. However, it is not obvious to me that raising the limit would be possible right now, nor is it obvious to me that it would actually help you in the long-term.

Technical conversations like these work best when we start from understanding and solving a problem, optionally with suggested solution if you believe or think there is something at hand that could solve it, and then working towards a useful outcome.

I appreciate that it is annoying when a limit like this is reached, but in general we decline these requests and instead recommend that the solution come from within the communinty, by organising the infromation such that the limit is not reached.

If you have run a wiki yourself, it might seem trivial to raise when you have only 1 active user (yourself), a handful of pages, and enough RAM to serve 1 person at a time to save and render whatever the biggest page is that you can produce alone. But, things get a bit more.. complicated, when we're talking about operating a thousand wikis from a few hundred servers with a set amount of RAM, and then actively serve an audience of billions without causing significant risk or delays to the infrastructure.

Some things to consider:

  • More memory consumed everytime it is interacted with by the software.
  • Every person has to wait and pay for downloading it over their connection, and parsing it bit-by-bit on their device.
  • Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.
  • Larger articles take longer to process within the software, whether for JavaScript interactions, or server-side MobileFormatter, server-side DiscussionTools, PDF rendering, etc.
  • Larger articles are slower to edit, as they are downloaded and uploaded every time.

Clearly we don't have infinite memory, bandwidth, or funds, and can't time-travel, hence there is a limit.

Question: Is there specific and limited concrete use case here that happpens to require a slightly higher limit to work? That is, is there some new structure or content type that has evolved that basically always needs a larger size to work, but is limited?

If yes, then I'd recommend describing that use case and what size you expect to need for content of that type, and then see if we can plan our next round of hardware purchases and UX design to fit that size.

If not, then I'm afraid this is just "the same as before" but deferring the task of solving an underlying problem in the information by raising the limit, which has no predictable end to it. I don't think we should raise it unless there is reasonable understanding of how big we need, why, and that when that limit is reached that we decline raising it again because the content was structured incorrectly. By that same logic we could then think that perhaps that is already where we are today and thus might decline this suggestion the same way.

In other words, if a new Wiksource page would require 100x more bytes than the current limit, what has the community come up with as the solution for that? Would that same solution work here? How much is enough? And why?

Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character.

To my knowledge most, if not all, of the above reasons relate to digital bytes and not visible characters.

Why not using $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size?

More memory consumed everytime it is interacted with by the software.

Isn't it cached after being generated once? In Wikisource, contrary to Wikipedia, pages are not being edited so frequently. So once the page is generated, no CPU should be used to show it (except some templates or linked files were changed). But memory maybe.

Every person has to wait and pay for downloading it over their connection, and parsing it bit-by-bit on their device.

The page size doesn't increase traffic consumption that much since text is transferred gzipped. Contemporary browsers have a lot of optimizations under the hood to be able to show really long pages without lags. They can even start show the start of the page before it completely downloaded (sometimes).

Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.

True for Wikipedia, but not for Wikisource. Imagine you make some research about a book and you need to find in which chapter a character named Bob was first introduced. If the whole book is loaded on the same page, then simple built in Page Search would work perfectly. Now imagine how to solve that task if every chapter is on their own page.

Larger articles are slower to edit, as they are downloaded and uploaded every time.

Again, it's not a big difference because of GZIP. To upload a 4 MB book you need to send only 1 MB of data.

According to this research: https://almanac.httparchive.org/en/2020/page-weight - in 2020 average page size (decompressed) is about 6-7 MB.

Another consideration: the alternative to a page with 4 MB of text is not a page with 1 MB of text, but rather 4 pages with 1 MB of text each.

When you think about this this way, then should be almost no difference nor for Wikimedia servers, nor for the user.

User Vladis13 did a great job importing some public domain texts in Russian Wikisource. He also prepared a folder with texts that were too big: https://disk.yandex.ru/d/QjFXyEiY0t7Qvw

I analyzed that folder and created this chart with file size buckets:

Screenshot 2022-05-21 at 21.47.22.png (680×1 px, 71 KB)

(you can see the chart and data in this spreadsheet: https://docs.google.com/spreadsheets/d/18sh9wyqzUg9MYbJpzrcq77UOR8B7MfTN5GpE1yyUFbE/edit?usp=sharing)

Conclusion: If the maximum article limit will be increased to 4 MB, he can automatically upload 314 books into Russian Wikisource. There are still 14 books that are larger than 4 mb, but at this point we can deal with those manually and split them into parts. But it's impossible to do it 300 books.

Thanks. I'll expand my comment with some additional concerns whilst also re-considering the option in context of narrowed application for Wikisource.

Snaevar subscribed.

Question: Is there specific and limited concrete use case here that happpens to require a slightly higher limit to work? That is, is there some new structure or content type that has evolved that basically always needs a larger size to work, but is limited?

The usecase that commonly puts wikisource against the limit is that wikisource users work on books one book page at a time. Each book page has its own wikipage in the "Page" namespace. Then once each book chapter is done, each of the wikipages of the chapter (from the Page namespace) are transcluded to the main namespace into an wikipage for each chapter. This also explains why it is so easy for wikisource to get to the limit. So the actual need depends on the biggest chapter of each book.

Made T309568 as an possible fix.

database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

The database schema do not require modification. The database text column is of type binary MEDIUMBLOB, GZIPed diffs are written there. The MEDIUMBLOB type can store up to 16Mb of data per entry, i.e. this is the diff size limit for any page edit (revision).
The number of page edits is not limited, all diffs are collected together only when the page is rendered. If we recall the words of the documentation "The compression ratio achieved on Wikimedia sites nears 98%." (DB doc), then it can be argued that the current database schema can store pages of infinite size.

Some things to consider:

  • Larger articles take longer to process within the software, whether for JavaScript interactions, or server-side MobileFormatter, server-side DiscussionTools, PDF rendering, etc.

As I wrote on T308893, I see the only difficulty here. - This is a syntax highlighting gadget in the browser editor that is automatically disabled on large pages. It is already unstable on relatively small pages with lots of tags/templates and/or when they are not correctly closed during editing.

MobileFormatter.
Modern e-readers with WiFi are usually equipped with old Android 4.4 and strange modifications of the Chrome browser that do not work with the JavaScript of most web pages (the browser crashes even on the google.com page). This is the situation with my OnyxBook Volta. However, I can easily open the largest page ruwikisource T308893#7954437 in mobile and PC versions.

DiscussionTools.
Here the issue is only about the Main namespace in Wikisources. I doubt this will be needed in the discussion pages and other Wikimedia projects.

PDF rendering
To be honest, I don't know who uses it in Wikisource. Because it is a very heavy format with large files. PDF is extremely inconvenient to read on mobile devices and PCs because it has a fixed page image size.
On the contrary, the main work of Wikisource users is to convert PDF to text. For those interested in the PDF, Wikisource has links to Commons and library sites.

Editing on wikisource has some peculiarities., the most wikipedia-users are not familiar with. We work in two steps: 1. preparing the actual text (in the '''Page:''' namespace). 2. putting pieces together (i.e. transluding the hundreds of pages) to one text.

The pages of both these types are quite small (one page maximally ~2000 Byte), the transcluding page just a few hundred bytes of _code_.

However the hundreds of transcluded pages easily beat the Post‐expand include size limit of 2095966. We are confronted with the problem quite often (on the Polish Wikisource there are now about 70 pages affected (from over 4000 texts).

Why we try to build such huge "articles"?

  • because we strive for maximal similarity between the original text and our result
  • because these pages are convenient when searching sth. in the text (when a figure appeared first etc.)

For users, who are not so familiar with the peculiarites of wikisource-projects: please think of a rather short article in the main space that trascludes hundreds of templates...

We cannot simply follow the straightforward advice: try to write two shorter articles (insead of one), because these is more convenient for the users. If you have a heavy thick manual, you do not take scissors and cut into two pieces because they would be easier to carry...

Would the new limit (4194304) help us? Certainly, it would resolve the problem. It is still possible that we hit the limit with one or two texts some day - however this is much less probable than now. It would certainly not encourage us to prepare even longer texts since, as I described it previously, the amount of text to be transcluded is not at our discretion.

Fuzzy raised the priority of this task from Medium to High.Jun 15 2023, 11:33 AM

We hit the fan once again with the Israeli Income Tax Ordinance. We cannot shorten the templates and we need an immediate solution.

I concur with [[User:Fuzzy]]; a direct solution to this is needed on Hebrew Wikisource.

The Income Tax Ordinance requires a temporrary immediate solution while we continue to ponder the best permanent one.

@Aklapper Is there a way to speed up the treatment? This task went unanswered for a long time.

@Aklapper Seems like a good idea. can you send it there?

@neriah: No but anyone who wants to discuss this topic is free to post on the mailing list and explain the topic.

Repeating myself from T189108#9054179:

What should be done is not increasing the post-expand include size to some random size, but moving some limits to be symbol-based and not byte-based. Currently non-ASCII wikis have much lower limits because, naturally, texts in their languages have to use Unicode (Кот is 6 bytes and Cat is 3 bytes, for example).

Given what I’ve read in the discussion above, it can be not necessarily about changing the limit to 2 Mb for all languages, but about decreasing the limit to an appropriate amount of symbols for all languages. It doesn’t make much sense that you are essentially better off if your language used Latin than if it used some other writing system. While I can explain that to someone on a technical level, that’s only half the story for non-technical people.

For the record, I don't think that the need to be able to build even longer pages justifies fulfilling this request. They can always split the pages, which is better for readability and usability. But it is clearly weird to everyone who encounters this limit that this limit favours ASCII-based languages more.

We cannot simply follow the straightforward advice: try to write two shorter articles (insead of one), because these is more convenient for the users. If you have a heavy thick manual, you do not take scissors and cut into two pieces because they would be easier to carry...

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

This is not a feasible solution for legal texts, such as the Israeli Income Tax Ordinance or the Israeli Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. The readers expect us to provide the complete law, not part-by-part.

For the record, I don't think that the need to be able to build even longer pages justifies fulfilling this request. They can always split the pages, which is better for readability and usability.

I disagree. From a usability standpoint, it's better to have the whole document on one page. For example, you can conduct a quick search in the entire text using the built-in browser search.

However, I agree that the current limits discriminate against non-ASCII-based languages.

‘Readers expect us to dump everything on one page’ is just your opinion, and so is ‘from usability standpoint, it’s better to have the whole document on one page’. From usability standpoint, it’s better to have a page that doesn’t weigh 2.3 Mb just in HTML. Some things cannot necessarily be done on separate pages (searching between them with browser tools), but it’s not an insurmountable difference (with built-in search). The fact that Hebrew Wikisource community doesn’t like to present those pages separately doesn’t mean that they can’t be presented separately.

From usability standpoint, it’s better to have a page that doesn’t weigh 2.3 Mb just in HTML.

Could you elaborate on why serving 2.3 Mb of HTML is bad from a usability standpoint?

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

It is obvious from your words that you have not read the discussion and arguments above. Including that all the users who spoke here from different language projects of Wikisource spoke in favor of expanding the limit. Due to big problems in splitting of pages for editors and usability for readers.
The expansion of the limit will not cause any technical problems.
Your opinion is the only your opinion as a Wikipedia user. This topic doesn't concern you at all.

Could you elaborate on why serving 2.3 Mb of HTML is bad from a usability standpoint?

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development. Given that a lot of readers are also on mobile, splitting the pages instead of requiring them to load an egregiously heavy page is better in the long term, even if the editors might not like that answer.

(@Vladis13 please keep in mind https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette as you are resorting to personal attacks and I have said nothing against your Wikisource which follows the Russian law on ‘extremist content’ to a tee.)

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development.

Readers don't care directly about the weight of a webpage's HTML. What they care about is how quickly they can access the document in which they're interested. Fortunately, most (if not all) contemporary browsers support webpage streaming, which allows them to begin rendering the webpage before its HTML has completely downloaded. Therefore, the time to Largest Contentful Paint for a webpage, whether it's 0.23 Mb or 2.3 Mb, should be more or less the same. Or not?

(@Vladis13 please keep in mind https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette as you are resorting to personal attacks and I have said nothing against your Wikisource which follows the Russian law on ‘extremist content’ to a tee.)

You call it a personal attack, although I only repeated your words "just your opinion". In addition, you reproach me personally with the rule of the Russian Wikisource, based on the policy of the Terms of Use of Wikimedia Foundation. Yes, this is all a personal attack, only yours.

None of this is helping move the discussion forward.

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

Readers don't care directly about the weight of a webpage's HTML. What they care about is how quickly they can access the document in which they're interested. Fortunately, most (if not all) contemporary browsers support webpage streaming, which allows them to begin rendering the webpage before its HTML has completely downloaded. Therefore, the time to Largest Contentful Paint for a webpage, whether it's 0.23 Mb or 2.3 Mb, should be more or less the same. Or not?

Yes, on almost any book site where you can download books they are downloaded in full size (1-4 or more Mb). This is all perfectly readable on smartphones and e-readers. No problems. None of the readers will download a lot of parts separately and read - this is extremely inconvenient, readers simply go to another site. And splitting the pages is hell for editors, wasting a LOT of time.

Representatives of the Jewish, Russian and Polish Wikisource have already expressed their support here.

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development. Given that a lot of readers are also on mobile, splitting the pages instead of requiring them to load an egregiously heavy page is better in the long term, even if the editors might not like that answer.

Perhaps this is true for Wikipedia articles and Twitter. But it is absolutely harmful for books, which are almost impossible to read from the site, and they are downloaded to e-readers (only modern ones have WiFi) or to gadgets from which they read with Wifi turned off.
In addition, books are exported to third-party sites and repositories, which is problematic to do from a pack of separated pages. It's easier to go to another site.
Also, now is 2023. MS Windows now permanents downloads gigantic online updates without fail. Google sends each word as you type it to the servers for spell checking. Any page on the Internet on average takes 50-100 or more Kb in the browser, size of short https://ru.wikipedia.org/wiki/Main_page is 120 Kb. It’s not serious to talk about problems with traffic and the convenience of reading and working with books in pieces, always from the site and with online enabled.

None of this is helping move the discussion forward.

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

Let me answer the complexities listed in the T275319#7947012.

Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.

The question of searching for a single text file and interlinks has been answered in detail above. In a single file, this is incomparably more convenient, both for users and for editors.
In general, I doubt that anyone reads large books from a PC monitor, all the while being on the site online, it is physically extremely inconvenient. Readers download in one file. The question is why this hell is for editors to splitting into pages (which usualy already splitted in Page NS), and then for readers to assemble it all into one file with https://www.mediawiki.org/wiki/Extension:Collection when download. All this takes a lot of effort and requests to servers.

On the issue of memory and the mentioned JS and PHP extensions, I answered in T275319#7977747.

Larger articles are slower to edit, as they are downloaded and uploaded every time.

This is not relevant for Wikisource. Because proofreading is performed in Page NS, on small pages that are already splitted, which are transcluded in the page of Main NS.
Else, if user editing a section in the Main NS, then only this section will get/sent to the server separately, as small request. (Arguments &section= in the URL, like https://www.mediawiki.org/w/index.php?title=Help:Editing_pages&action=edit&section=1.)

That is, editing on one page is faster than on split ones. For example:
a) You are reading a downloaded book, found a typo, open a page in Wikisource - press Ctrl-F to search for a word or go to a section by the table of contents at the top - then edit the section (only it is quickly downloaded to the editor) or click on the page number to go to a small subpage in Page NS. Done.
b) If the pages are separated in the Main NS, then to search for the desired part in the book file, you need to find this part via the Wikisource search engine (this is a heavy query to the database, since the search index of all Wikisource pages is looked up), then you need to find the desired page by snippets, go to it and then follow step a). And so for each typo found in the text.

and then see if we can plan our next round of hardware purchases

Could you clarify this? This issue affects a small amount of pages on little popular Wikisource projects. Whereas, Commons servers send tons of TB of files daily. And Wikipedia sends tons of GB pages, including tons of subrequests to API, database, Wikidata. Against this background, traffic on this issue is a drop in the ocean.

I previousely asked, and I'm still wating for an explanation –

Why not using $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size?