Page MenuHomePhabricator

Change $wgMaxArticleSize limit from byte-based to character-based
Open, MediumPublic

Description

The Maximum article size (AKA post-expand include size) is set to 2048Kb. This limit is configured by the $wgMaxArticleSize variable. We ask to raise the limit to 4096Kb for the Hebrew Wikisource. We already hit the limit with two heavily accessed pages: Income Tax Ordinance and Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. Those pages are rendered incorrectly due to the limit. Other pages, such as Transportation Regulations, 5721–1961, are expected to hit to the limit in the near future.

Breaking the legal text into sections is not felt to be a valid solution. Also note that Hebrew characters are two-bytes per character, whereas Latin characters are one-byte per character. Therefore the limit for Hebrew text is half of the limit of a Latin text of the same length.

Related Objects

Mentioned In
T365819: Shared citations for multiple pages
T365812: Generalized stable link mechanism to both *page* and *section*
T365810: Export a collection of pages as a single document (PDF, HTML, printable) *client-side*
T365808: "Browser search" across related/split articles
T365806: Infinite scroll for articles (split documents on wikisource)
T189108: Increase the « Post‐expand include size » process up to 2.5 MB
T325836: Problem with $wgMaxArticleSize at cswiki
T325665: Increase title length
T325650: Cannot create translated category if its name is >255 bytes in UTF-8
T308796: Lua error: Not enough memory due to several templates in pages
T309568: Investigate moving Wikisource Page namespace to mainspace
T308893: Increase $wgMaxArticleSize to 4MB for ruwikisource
Mentioned Here
T365812: Generalized stable link mechanism to both *page* and *section*
T365819: Shared citations for multiple pages
T365806: Infinite scroll for articles (split documents on wikisource)
T365808: "Browser search" across related/split articles
T365810: Export a collection of pages as a single document (PDF, HTML, printable) *client-side*
rSVN23389: * (bug 10338) Enforce signature length limit in Unicode characters instead of…
T12338: Length of nickname
T15260: post expand size counted multiple times for nested transclusions
T189108: Increase the « Post‐expand include size » process up to 2.5 MB
T309568: Investigate moving Wikisource Page namespace to mainspace
T308893: Increase $wgMaxArticleSize to 4MB for ruwikisource
T239841: Parsoid resource limits and metrics use mb_strlen, which is (a) inconsistent w/ Parser.php and (b) slow
T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM
T158242: One pl.wikisource page including the text of 700 other pages hits parser limits
T179253: Increase page size limit for map-data on commons
T181907: Wikibooks: increase maximum page size after expansion
T272546: Maximum size reached for my PhD thesis on fr.wikiversity

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Fuzzy raised the priority of this task from Medium to High.Jun 15 2023, 11:33 AM

We hit the fan once again with the Israeli Income Tax Ordinance. We cannot shorten the templates and we need an immediate solution.

I concur with [[User:Fuzzy]]; a direct solution to this is needed on Hebrew Wikisource.

The Income Tax Ordinance requires a temporrary immediate solution while we continue to ponder the best permanent one.

@Aklapper Is there a way to speed up the treatment? This task went unanswered for a long time.

@Aklapper Seems like a good idea. can you send it there?

@neriah: No but anyone who wants to discuss this topic is free to post on the mailing list and explain the topic.

Repeating myself from T189108#9054179:

What should be done is not increasing the post-expand include size to some random size, but moving some limits to be symbol-based and not byte-based. Currently non-ASCII wikis have much lower limits because, naturally, texts in their languages have to use Unicode (Кот is 6 bytes and Cat is 3 bytes, for example).

Given what I’ve read in the discussion above, it can be not necessarily about changing the limit to 2 Mb for all languages, but about decreasing the limit to an appropriate amount of symbols for all languages. It doesn’t make much sense that you are essentially better off if your language used Latin than if it used some other writing system. While I can explain that to someone on a technical level, that’s only half the story for non-technical people.

For the record, I don't think that the need to be able to build even longer pages justifies fulfilling this request. They can always split the pages, which is better for readability and usability. But it is clearly weird to everyone who encounters this limit that this limit favours ASCII-based languages more.

We cannot simply follow the straightforward advice: try to write two shorter articles (insead of one), because these is more convenient for the users. If you have a heavy thick manual, you do not take scissors and cut into two pieces because they would be easier to carry...

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

This is not a feasible solution for legal texts, such as the Israeli Income Tax Ordinance or the Israeli Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970. The readers expect us to provide the complete law, not part-by-part.

For the record, I don't think that the need to be able to build even longer pages justifies fulfilling this request. They can always split the pages, which is better for readability and usability.

I disagree. From a usability standpoint, it's better to have the whole document on one page. For example, you can conduct a quick search in the entire text using the built-in browser search.

However, I agree that the current limits discriminate against non-ASCII-based languages.

‘Readers expect us to dump everything on one page’ is just your opinion, and so is ‘from usability standpoint, it’s better to have the whole document on one page’. From usability standpoint, it’s better to have a page that doesn’t weigh 2.3 Mb just in HTML. Some things cannot necessarily be done on separate pages (searching between them with browser tools), but it’s not an insurmountable difference (with built-in search). The fact that Hebrew Wikisource community doesn’t like to present those pages separately doesn’t mean that they can’t be presented separately.

From usability standpoint, it’s better to have a page that doesn’t weigh 2.3 Mb just in HTML.

Could you elaborate on why serving 2.3 Mb of HTML is bad from a usability standpoint?

Wikisource editors can absolutely split pages into smaller ones, since those longer pages have sections and subsections. The community choice might be not to split them, but that’s not necessarily a technical problem.

It is obvious from your words that you have not read the discussion and arguments above. Including that all the users who spoke here from different language projects of Wikisource spoke in favor of expanding the limit. Due to big problems in splitting of pages for editors and usability for readers.
The expansion of the limit will not cause any technical problems.
Your opinion is the only your opinion as a Wikipedia user. This topic doesn't concern you at all.

Could you elaborate on why serving 2.3 Mb of HTML is bad from a usability standpoint?

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development. Given that a lot of readers are also on mobile, splitting the pages instead of requiring them to load an egregiously heavy page is better in the long term, even if the editors might not like that answer.

(@Vladis13 please keep in mind https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette as you are resorting to personal attacks and I have said nothing against your Wikisource which follows the Russian law on ‘extremist content’ to a tee.)

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development.

Readers don't care directly about the weight of a webpage's HTML. What they care about is how quickly they can access the document in which they're interested. Fortunately, most (if not all) contemporary browsers support webpage streaming, which allows them to begin rendering the webpage before its HTML has completely downloaded. Therefore, the time to Largest Contentful Paint for a webpage, whether it's 0.23 Mb or 2.3 Mb, should be more or less the same. Or not?

(@Vladis13 please keep in mind https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette as you are resorting to personal attacks and I have said nothing against your Wikisource which follows the Russian law on ‘extremist content’ to a tee.)

You call it a personal attack, although I only repeated your words "just your opinion". In addition, you reproach me personally with the rule of the Russian Wikisource, based on the policy of the Terms of Use of Wikimedia Foundation. Yes, this is all a personal attack, only yours.

None of this is helping move the discussion forward.

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

Readers don't care directly about the weight of a webpage's HTML. What they care about is how quickly they can access the document in which they're interested. Fortunately, most (if not all) contemporary browsers support webpage streaming, which allows them to begin rendering the webpage before its HTML has completely downloaded. Therefore, the time to Largest Contentful Paint for a webpage, whether it's 0.23 Mb or 2.3 Mb, should be more or less the same. Or not?

Yes, on almost any book site where you can download books they are downloaded in full size (1-4 or more Mb). This is all perfectly readable on smartphones and e-readers. No problems. None of the readers will download a lot of parts separately and read - this is extremely inconvenient, readers simply go to another site. And splitting the pages is hell for editors, wasting a LOT of time.

Representatives of the Jewish, Russian and Polish Wikisource have already expressed their support here.

Because heavy pages load worse for readers, especially on poorer connections. That’s basics of web development. Given that a lot of readers are also on mobile, splitting the pages instead of requiring them to load an egregiously heavy page is better in the long term, even if the editors might not like that answer.

Perhaps this is true for Wikipedia articles and Twitter. But it is absolutely harmful for books, which are almost impossible to read from the site, and they are downloaded to e-readers (only modern ones have WiFi) or to gadgets from which they read with Wifi turned off.
In addition, books are exported to third-party sites and repositories, which is problematic to do from a pack of separated pages. It's easier to go to another site.
Also, now is 2023. MS Windows now permanents downloads gigantic online updates without fail. Google sends each word as you type it to the servers for spell checking. Any page on the Internet on average takes 50-100 or more Kb in the browser, size of short https://ru.wikipedia.org/wiki/Main_page is 120 Kb. It’s not serious to talk about problems with traffic and the convenience of reading and working with books in pieces, always from the site and with online enabled.

None of this is helping move the discussion forward.

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

Let me answer the complexities listed in the T275319#7947012.

Larger articles tend to be harder to discover relevant information in from search, or to navigate and link between.

The question of searching for a single text file and interlinks has been answered in detail above. In a single file, this is incomparably more convenient, both for users and for editors.
In general, I doubt that anyone reads large books from a PC monitor, all the while being on the site online, it is physically extremely inconvenient. Readers download in one file. The question is why this hell is for editors to splitting into pages (which usualy already splitted in Page NS), and then for readers to assemble it all into one file with https://www.mediawiki.org/wiki/Extension:Collection when download. All this takes a lot of effort and requests to servers.

On the issue of memory and the mentioned JS and PHP extensions, I answered in T275319#7977747.

Larger articles are slower to edit, as they are downloaded and uploaded every time.

This is not relevant for Wikisource. Because proofreading is performed in Page NS, on small pages that are already splitted, which are transcluded in the page of Main NS.
Else, if user editing a section in the Main NS, then only this section will get/sent to the server separately, as small request. (Arguments &section= in the URL, like https://www.mediawiki.org/w/index.php?title=Help:Editing_pages&action=edit&section=1.)

That is, editing on one page is faster than on split ones. For example:
a) You are reading a downloaded book, found a typo, open a page in Wikisource - press Ctrl-F to search for a word or go to a section by the table of contents at the top - then edit the section (only it is quickly downloaded to the editor) or click on the page number to go to a small subpage in Page NS. Done.
b) If the pages are separated in the Main NS, then to search for the desired part in the book file, you need to find this part via the Wikisource search engine (this is a heavy query to the database, since the search index of all Wikisource pages is looked up), then you need to find the desired page by snippets, go to it and then follow step a). And so for each typo found in the text.

and then see if we can plan our next round of hardware purchases

Could you clarify this? This issue affects a small amount of pages on little popular Wikisource projects. Whereas, Commons servers send tons of TB of files daily. And Wikipedia sends tons of GB pages, including tons of subrequests to API, database, Wikidata. Against this background, traffic on this issue is a drop in the ocean.

I previousely asked, and I'm still wating for an explanation –

Why not using $wgMaxArticleSize as a limit for the page raw size, and 2*$wgMaxArticleSize as limit for the page post-expand include size?

In short: it's not obvious what the new limit "should be", and in fact it's fairly certain that whatever the new limit is, there will still be source texts which will exceed it.

@Krinkle's reply above https://phabricator.wikimedia.org/T275319#7947012 is an excellent summary of the issues and points a way forward toward further understanding of the issue.

@Reedy and @cscott are WMF Senior Engineers and @Krinkle is a WMF Principal Engineer, so those answers (and questions posed) are as authoritative as you're likely to get absent a Director or C-level WMFer commenting here.

I wish to avoid the discussion what "should be" the limit of the page length, in characters.

The issue is with setting the limit in bytes rather than in characters. As said, Hebrew texts require two-bytes per characters, whereas Latin texts require one byte per character. As a result, the same byte limit allows for twice as many Latin characters as Hebrew characters. This effectively reduces the allowable length of Hebrew text to half that of Latin text, which can be seen as discriminatory against Hebrew texts. This discrepancy unfairly restricts the length of content that can be written in Hebrew compared to Latin-based languages, creating an inequitable constraint for Hebrew texts.

However, I suggested a straightforward change to address our immediate need, which is using 2*$wgMaxArticleSize as the limit for the page post-expand include size.

I wish to avoid the discussion what "should be" the limit of the page length, in characters.

For a database or CPUs characters don't matter, bytes do.

This comment was removed by Fuzzy.

I wish to avoid the discussion what "should be" the limit of the page length, in characters.

For a database or CPUs characters don't matter, bytes do.

For usability, URS and UX, bytes don't matter, characters do.

While non-Latin characters take twice as space, since Arabic and Hebrew scripts don't write short vowels (unless for children or disambiguation), the average letter per word is much lower than Latin scripts. In other words, even if you translate it to English, it'll just go above the threshold.

Many wikisource wikis split large laws into chapters and even sections. Here is an example:
https://en.wikisource.org/wiki/United_States_Code/Title_17/Chapter_1/Sections_105_and_106

While non-Latin characters take twice as space, since Arabic and Hebrew scripts don't write short vowels (unless for children or disambiguation), the average letter per word is much lower than Latin scripts.

This is true but not all other languages have the same ‘advantage’. Russian routinely has bigger words and writes vowels. I don't think necessarily $wgMaxArticleSize is the thing to change though, since 2 MB pages are not really a useful need, but stuff like PEIS should definitely move to considering symbols and not bytes because that privileges Latin-based languages.

This discussion risks going in circles. As I wrote previously in T275319#6884320:

zhwiki for example should have 4x the character limit if this is to be the new rule. Unlike what is claimed above, many of the performance metrics *do* scale with bytes rather than characters -- most wikitext processing is at some point regexp-based, and that works on bytes (unicode characters are desugared to the appropriate byte sequences), and of course network bandwidth, database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

And as @Reedy stated in T275319#9057445:

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

Although I'm not going to close this task as declined, as @Reedy says, I do think it would be more worthwhile to focus on the actual use cases as @Krinkle had suggested, which is more likely to yield an actionable product decision. For example, we could bind Ctrl-F to our internal wiki search and provide a mode with a default search filter "in the same book as the present page" and solve the searchability issue with books split across multiple articles. That sort of thing is possible to demo in a hackathon and is probably easier to resource than simply increasing the article page size limit every time someone finds a source text which does fit the current limit.

FWIW, I’ve read the comment and I disagree that my point above should be disregarded just because ‘it scales with bytes’. PEIS for example is not an objective measure, but a metric that is supposed to track the complexity of the templates on a page. The complexity of the templates on a page does not increase if I write in Russian instead of English, and Russian Wikipedia should not be punished by metrics that monolingual people came up with 15 years ago.

In the case of Israeli laws, their length consistently falls below the intended limit. But we use complex templates for the presentation of the legal text, which may expand above the limit. For instance, the Income Tax Ordinance is 1,512,975 bytes (896,491 characters), expanding to 2,230,303 bytes; the Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–1970 is 1,225,044 bytes (750,579 characters), expanding to 2,044,793 bytes; and the Transportation Regulations, 5721–1961 is 1,252,942 bytes (756,967 characters), expanding to 2,096,963 bytes.

We can see that in general, when length is measured in bytes, the allowed text is reduced by 40% due to the 2-bytes per Hebrew character. However, we also see that for our immediate concerns (not the general issue of Hebrew vs. Latin texts), the issue doesn't primarily lie with the maximal page length, regardless of how it's measured. Instead, it seems to arise from the template expansion mechanism, which increases the page length by about 70%. It seems the post-expansion size is calculated in a wrong way. For instance, if a template verifies the existence of a parameter {{{1}}} using {{#if:{{{1|}}} | ... }}, the post-expansion size is increased by the length of {{{1}}} even if it doesn't appear in the post template-expansion text.

@stjn you are correct that this particular issue is a mix of social and technical factors as I pointed out in T275319#6884320. The technical factors absolutely scale with bytes; the social factors scale with <a more complicated metric related to information entropy>.

This discussion risks going in circles. As I wrote previously in T275319#6884320:

zhwiki for example should have 4x the character limit if this is to be the new rule. Unlike what is claimed above, many of the performance metrics *do* scale with bytes rather than characters -- most wikitext processing is at some point regexp-based, and that works on bytes (unicode characters are desugared to the appropriate byte sequences), and of course network bandwidth, database storage size, database column limits, etc, all scale with bytes not characters. We should be careful before bumping the limit that we're not going to run into problems with database schema, etc.

And as @Reedy stated in T275319#9057445:

Timo's comment in T275319#7947012 is still relevant.

And at the same time, no one has declined this task, or said it won't ever be done. It requires planning, effort, testing and actual prioritisation by the people doing those activities etc.

It seems you have ignored my replies to this: T275319#7977747 and T275319#9057620. If don't notice the answers, the discussion really risks going in circles.

And at the same time, no one has declined this task, or said it won't ever be done.

Although I'm not going to close this task as declined

These answers look strange... like “no one has declined this task, but we will never do this in principle.” ;-)

Discussions for more than 3 years. By the way, today is exactly 2 years (to the day) since I opened T308893. But to this day it is also being discussed what was in her header, that in UTF-8 English characters are encoded in 1 byte, and all other languages are encoded in 2 bytes.

Honestly, I think this should be declined as this is a x/y problem. I understand you need to see content of the whole law in one place, but that doesn't mean size of page should increase. A proper solution here is to have a way to see a full book or set of pages in one place. Think of google docs or a PDF file. That is doable but needs to resourced and prioritized and implemented. That can be useful in many other areas too (being able to read a large book in a way of "infinite scroll" on the wiki)

In other words, increasing page limit puts much more stress on the infrastructure to solve a UX problem that can be solved in different ways.

Multi-page rendering wasn't implemented in the new PDF renderer because, well, PDF renderer memory requirements also scale with bytes.

I suppose having some kind of special page or gadget which just stitches together multiple HTML blobs from the parser cache would be straightforward.

I've compiled a table detailing some of the largest legislative texts found on Hebrew Wikisource. Before drawing any conclusions, it's essential to carefully examine the data:

Legislative TitleRaw Size (bytes)Raw Size (chars)Reported PEIS (bytes)Actual PEIS (bytes)Actual PEIS (chars)
Income Tax Ordinance1,512,975896,4902,230,3032,082,9251,484,844
Planning and Building (Application for Permit, Conditions and Fees) Regulations, 5730–19701,225,044750,5772,044,7931,832,8021,369,277
Transportation Regulations, 5721–19611,252,942756,9652,097,6561,890,6381,412,173
Civil Aviation (Operation of Aircraft and Flight Rules) Regulations, 5742–19811,275,991758,5101,629,839 1,786,9911,394,914

The Raw Size represents the actual size of the page, measured in bytes and characters. The Reported PEIS is the post-expand include size as provided by the NewPP parser report, measured in bytes. If the PEIS exceeds 2,097,152 bytes, it's calculated by parsing the page in parts. The Actual PEIS is the actual HTML size after parser processing, measured in bytes and characters. It is measured using Special:ExpandTemplates. This metric is what the post‐expand include size should be when compared to the maximal article size limit.

What now?

On average, there's about a 1.65 ratio between measuring length in bytes versus characters. (The "Actual PEIS" ratio is 1.35 since all templates have been expanded to HTML tags, which are one byte per character.) This discrepancy highlights a penalty imposed on non-Latin texts when length is measured in bytes rather than characters.

The primary purpose of the $wgMaxArticleSize variable is to keep page size under a predefined configurable limit, currently set to 2 Mb or 2,097,152 bytes. This limit should be applied at two checkpoints within the parser. The first checkpoint limits the raw size of the page (excluding HTML comments, <nowiki> tags, etc.). The second checkpoint is after all template expansions, when the parser generates the HTML representation of the page. This second checkpoint is necessary to prevent the construction of excessively long pages using template inclusions.

However, the current method of PEIS calculation is irrelevant to these checkpoints. As explained here, the PEIS is the sum of the lengths of the expanded texts generated by templates, parser functions, and variables. It is noted there that the sizes of the texts of all expanded templates and parser functions are added, even in the case of nesting templates (See T15260). While this metric measures the complexity of the page, it should be bounded differently from the maximal page size, as its goal is to prevent malformed recursive calls and inefficient template usage.

What next?

Some commenters suggest splitting the legislative texts into parts. While the basic logic is sound — if we raise the limit today, others might ask to raise it again, leading to infinitely long pages — this doesn't apply to the legislative texts in Hebrew Wikisource. The size of our heaviest legislative text is still less than 1,048,576 characters, which is half the intended limit of 2,097,152 characters. Unfortunately for Hebrew texts, the limit is set in bytes rather than characters. Additionally, there are several reasons why splitting legislative texts of reasonable size is not a viable solution:

  • Contextual Understanding: Legislative texts often rely heavily on context, with sections referencing and building upon each other. Splitting the text can disrupt the flow and make it harder for readers to understand the full context of the law.
  • Numerous Internal and External References: Unlike a book, a legislative text contains numerous internal and external references. For example, if Article X mentions Article Y, there will be a hyperlink to Article Y in the text. According to those who advocate splitting, clicking the hyperlink would redirect the reader to another page containing Article Y. This results in a poor user experience as readers have to jump from sub-page to sub-page.
  • Link Integrity: Maintaining the integrity of hyperlinks in split texts is difficult. If a section is moved or renamed, all related links need to be updated, which can is error-prone. Additionally, it will be impossible to link to a specific article without knowing the arbitrary division of the text.
  • Unsearchable Split Texts: Split texts become unsearchable, making it difficult to find specific information within the legislative text.
  • Exporting Issues: MediaWiki doesn't support exporting a divided legislative text back into a single document.

As far as I see it, the problem can be addressed in several ways:

A. Fixing the bias against non-Latin texts. This can be achieved either by measuring the size by characters instead of bytes, or manually increasing the $wgMaxArticleSize for non-Latin sites.

B. Using a proper PEIS metric: The current PEIS metric is not suitable for setting the limit from a social perspective. It was implemented from a technical perspective and harms complicated templates that use nesting and parser functions. The proper method is to calculate the real PEIS, which is the size of the HTML snippet the parser generates, and limit its length to $wgMaxArticleSize. The current PEIS metric should be bounded by 2*$wgMaxArticleSize, to avoid recursive and inefficient templates.

@Fuzzy you may be interested in T254522: Set appropriate wikitext limits for Parsoid to ensure it doesn't OOM, which will eventually replace the limits in the legacy parser. Appropriate metrics are not easy to find, because ideally they must be computed *before* spending the compute resources that a full computation of the desired result would achieve. That is why there are separate limits on article size and expanded size (and cpu time, and expansion depth, and expensive function count, and visited postprocessor nodes, etc). At every point we try to avoid spending the resources to do the actual expansion if it is likely based on "what we already know" that the other limits would fail. If we do the entire expansion and rendering to HTML and *then* check to see if it turned out to be too big we're already too late to reclaim the resources spent.

I'll note that another compounding factor here is editor predictability. Our existing metrics aren't perfect in this regard (see some of the discussion above and at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_211#WP:PEIS about markup "improvements" that didn't actually improve the limited metrics) but the goal is to not just prevent bad things from happening but also ensure that failures and their remedies are reasonably intelligible to editors encountering them. "HTML size" isn't a great metric because it's not obvious to the editor at all why <strong> should "cost" more than <b> or <i>. Wikitext-based metrics seem to (a) correlate reasonably well with HTML size, (b) are understandable by editors, and (c) can be computed very early, before actual parsing. Those are all useful qualities for any proposed replacement metric.

@Vladis13 I have read all of your comments. They are considering only the social and client-side factors and ignore the server-side issues which are the actual blocker here. As a reminder: these limits are fundamentally preventing DoS attacks. For every "good" use of an expanded size, we must also consider the ways those larger limits would be abused by "the bad guys". The fact that things seem to work fine when everyone is playing by the rules ignores the ways that those larger limits can be leveraged into further attacks of various kinds. Although computer science is a quest for linear-time algorithms, there are still things which don't scale linearly, so "just" increasing a limit by (say) 2x doesn't mean that a malicious actor can "only" use 2x more resources, sadly.

Again, none of these are absolute blockers. But the fact that a proper solution has to weigh and balance many different factors, both social and technical, both client and server-side, is why I am inclined to agree with @ladgroup and why I suggested finding other ways to address the issues enumerated by @Fuzzy (Contextual Understanding, Internal and External References, Link Integrity, Unsearchable Split Texts, and Exporting Issues) which think outside the "we just need larger article limits" box. I think good solutions to @Fuzzy's issues would benefit a lot more use cases than just "legal texts on hewikisource".

Agree with @Fuzzy's suggestion above.

However, I suggested a straightforward change to address our immediate need, which is using 2*$wgMaxArticleSize as the limit for the page post-expand include size.

It doesn't make sense that because in Hebrew letters each character counts as two bytes, in the Hebrew language we can insert texts that are half as long as in languages written in Latin letters.

Thanks.

[...] Appropriate metrics are not easy to find, because ideally they must be computed *before* spending the compute resources that a full computation of the desired result would achieve. [...] At every point we try to avoid spending the resources to do the actual expansion if it is likely based on "what we already know" that the other limits would fail. If we do the entire expansion and rendering to HTML and *then* check to see if it turned out to be too big we're already too late to reclaim the resources spent.

The parser does not prevent excessive text in advance; it merely trims the text or halts further processing at a certain point. Therefore, I've suggested a more nuanced approach using three related metrics instead of the current two: using $wgMaxArticleSize to trim the raw page size before processing, using 2*$wgMaxArticleSize to limit recursive or inefficient template expansions, and using $wgMaxArticleSize to trim the Actual PEIS (HTML plain length, or excluding markups). This approach balances the need to manage resources effectively while still allowing for complex and content-rich pages.

Again, none of these are absolute blockers. But the fact that a proper solution has to weigh and balance many different factors, both social and technical, both client and server-side, is why I am inclined to agree with @ladgroup and why I suggested finding other ways to address the issues enumerated by @Fuzzy (Contextual Understanding, Internal and External References, Link Integrity, Unsearchable Split Texts, and Exporting Issues) which think outside the "we just need larger article limits" box. I think good solutions to @Fuzzy's issues would benefit a lot more use cases than just "legal texts on hewikisource".

While I appreciate the suggestion to find alternative solutions which address the broader issues, the situation with legislative texts on Hebrew Wikisource requires more immediate action. Waiting for potential feature requests to be implemented in the distant future is not feasible. These legislative texts need to be properly presented now, and my proposed solutions provide a balanced and relatively simple way to achieve this without breaching the social limitations, nor the technical ones.

Hello to everyone! I work primarily on the hebrew wikisource site, mostly involved in the religious texts etc. While I can in no way be as thorough or explanative as @Fuzzy regarding the technical elements discussed, I would like to let it be known that the size limit has definitely been a thorn in my side for many years now on a number of projects. As was suggested by some - I of necessity had to split texts into 2 separate pages. One example would be ZOHAR PARASHAT PEKUDEI (a particularly long section).

I would appreciate it if you could please reconsider granting higher limits to the wiki sites that request it. If you feel that doubling the size is too excessive - then please atleast consider raising it somewhat. Meet us somewhere in between. It just feels like there's no consideration being given to the needs that we as the end users are reporting to you, an opinion that has been formed from years if not decades of using the interface.

I have no leverage in persuading you to change the limit size from my technical expertise - but i can only say it feels like there's a lack of flexibility on the part of some opinions from even offering a partial raising of limits as a sign of "goodwill" at the very least to request of countless users from varying language sites.

In any event, blessings to all the wonderful people involved in this most wonderful of projects. May you continue in the wonderful work you're all doing!

I would like to point out that a similar issue arose a long time ago, regarding the length of custom signatures, which was expressed in bytes rather than characters (see T12338: Length of nickname and rSVN23389: * (bug 10338) Enforce signature length limit in Unicode characters instead of…). This precedent demonstrates that addressing size limits in a way that accounts for non-Latin characters is not without foundation.

@cscott, if you are concerned about the performance impact of mb_strlen(), it is possible to use UTF-32 encoding within the parser. This approach could potentially improve the text processing time of the parser in general.

@cscott - many thanks to you for your help - both in regard to this thread and the work you do for wikimedia as a whole. blessings,

Change #1035427 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add $wgArticleMaxSizeChars for character-size limits on page size

https://gerrit.wikimedia.org/r/1035427

Change #1035428 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add $wgAPIMaxResultSizeChars for character-size limits on API results

https://gerrit.wikimedia.org/r/1035428

So, having written the above two patches to replace byte-size limits with character-size limits, let me set out why I'm afraid this is a Bad Idea and I've done a Bad Thing by even bringing up the possibility of a technical fix here. Sorry:

  • Solving the five tasks outlined by @Fuzzy in T275319#9818815 above would be a much better solution to the root cause problems here, and provide features generally useful to a number of projects outside this one particular case:
    • Contextual Understanding: solutions to this might involve the sort of 'infinite scroll' UX common on the web, where we could chain separate pages together to provide a seamless reading experience. This could work on any number of pages where there is a both a hierarchical or other structure as well as a suggested reading order (API docs, historical events, etc) T365806
    • Internal and External References: the Cite extension could be improved to better handle this case. Having citations stored in a separate page or subpage and then referenced in a way that would allow a final "references" section to combine them would be generally useful. I know the "citations in wikidata" folks are also thinking in this general direction. Other wikisources have also solved this problem with templates which manually insert links to a final end-notes section. Improving support for this sort of pattern would be great. T365819 is one proposal.
    • Link Integrity: Again a super general problem which would benefit from work. We have permalinks and a link shortener, and there are templates like {{anchor}} as well. DiscussionTools as well as done some work on maintaining links when sections are renamed and the page title changes (in DT's case, existing topics moved to archive pages). We just need to put the pieces together to allow a standard solution to allow you to write an anchor to a specific section of a page which is robust against both section renaming and the section being moved to a different title. T365812
    • Unsearchable Split Texts: Just bind Ctrl-F to a search query that includes all pages in a certain category or that share a page prefix. @Tgr had some ideas here, apparently the implementation is not hard at all. T365808
    • Exporting Issues: This was the Collection extension's entire raison d'ê·tre. It has suffered from neglect and a lack of maintainers -- as well as from DoS issues which touch on common problems we've discussed here. Creating a PDF for a book containing hundreds of pages of content is computationally expensive. Some way of providing that feature while also protecting it from abuse is needed, but this is something which could 100% be built in Wikimedia Labs. T365810
  • Increasing size limits is a one-way ratchet. Once articles of increased size are allowed through and stored in the database, it is really hard to get them back out. For better or worse, MediaWiki's article size limits were built around preventing overly-large articles from being stored in the first place and the code to deal with articles exceeding the limits which are already coming from the database is comparatively immature. I'd like to say "we'll try out larger limits on wiki X for a while, and if this leads to problems (with resource consumption, DoS attacks, etc) then we'll just bump them back to what they were" but that is, unfortunately, not straightforward from a SRE perspective. We'd need a rollback strategy Just In Case before we actually deployed something like this.
  • @Krinkle mentioned that there are various situations where we *do* actually want/need to know specific byte size limits on different things. The particular approach in the patch above limits byte size to 4x the character limit, due to the way that UTF-8 works, but that's not a particularly tight bound, and there are extensions of UTF-8 which permit 5- or 6-byte characters as well. You could consider counts based on PHP's grapheme_strlen but that could lead to even larger bytes-per-grapheme counts. We would probably want a combination of byte- and character-based limits just to ensure some amount of predictability.
  • As elaborated at length above, these patches are only a band-aid, and it is a near certainty that new source texts will be found which violate the any newly-raised limits. Solving @Fuzzy's five tasks would be a permanent solution.

So with all the reasons why the patches above are a Bad idea taken care of, I'll briefly say why I wrote the patches. First, not increasing the limits but simply switching them to character based is a largely neutral change for our largest wikis, which tend to be Latin script. So the effects of the change on resource-consumption, while still unknown and potentially frightening (from a DoS or SRE perspective), are at least somewhat limited. Second, this change is "more principled" than just increasing the limit on a particular wiki: we address the legit equity issues without uncorking a torrent of "me too" requests for larger limits whose varied justifications we'd have to arbitrate. The principle is "all wikis have the same character limit". This doesn't completely address the equity issue, as some wikis will still have smaller grapheme-to-character ratios, but we'd be closer. Third, when I looked at the code, the places which touched MaxArticleSize (and the related APIMaxResultSizeChars) are fairly limited, so this wasn't a large or particularly scary patch from the code review perspective. I'm not too worried I missed a check somewhere and allowed a new DoS attack to bypass limits entirely.

But I am quite concerned I've distracted from finding proper solutions to @Fuzzy's five tasks, which is where I would really like to focus attention.

@cscott, thank you very much for your work on this issue. I completely agree that changing the maximum article size is not the right approach. Instead, we need to adjust the measuring scheme so that non-Latin texts are not penalized. Your patches, which replace strlen() with mb_strlen() and measure page size in characters instead of bytes, effectively resolve the issue for Hebrew Wikisource. Given this, perhaps "Change $wgMaxArticleSize limits from byte-based to character-based" would be a more appropriate title for the task.

As a side note, it seems there is a bug in the current PEIS implementation. In my table above, the reported PEIS for the Civil Aviation Regulations (the one with the icon) appears inconsistent with the other reported PEIS values. I have not investigated this issue further.

I've created the following five feature requests as strawdog proposals to address each of @Fuzzy's five concerns. Feel free to poke holes in my proposed solutions, suggest improvements or alternatives, etc. But I want to keep the focus on making split documents work better, since we're going to continue to butt up against article size limits.

All of those proposals are fundamentally worse (and probably will be more inaccessible) from UX perspective, so I hope they would be confined to specific Wikisource pages and will never become a thing on all pages in all projects. Infinite scroll in particular is a plague of the modern web and is not a practical solution to any problem. Breaking browser search without user’s consent is also not a solution to any problem.

Please comment on the specific tasks, your concerns are addressed there. But also feel free to suggest other solutions! The point is that @Fuzzy's list of desiderata is an excellent one, and we should continue to work on finding solutions to those issues that don't necessarily involve increasing article size ad infinitum. There are plenty of large topics that could benefit from UX improvements to better tie together a collection of separate pages.

This comment was removed by Soda.

I will note Exporting Issues should be solved by the Wikisource extension which adds a download button on pages that can be downloaded.

Fuzzy renamed this task from Raise limit of $wgMaxArticleSize for Hebrew Wikisource to Change $wgMaxArticleSize limit from byte-based to character-based.May 30 2024, 9:04 AM

While discussing performance issues on Discord, I looked at https://he.wikisource.org/wiki/פקודת_מס_הכנסה (Income Tax Ordinance?) again and saw this:

image.png (290×1 px, 28 KB)

In this case (and, I assume, many others) there is a link that in wikitext (-like pseudocode) is something like <span class="law-external">[[Link target|<span title="Link target">displayed text</span>]]</span>. This seems wasteful if you want to cut down on PEIS, so it would be good to look at converting that template somehow to output less wikitext in such cases. It seems to be caused by https://he.wikisource.org/wiki/תבנית:ח:חיצוני template, although I might be wrong since I don't read Hebrew and it is hard to navigate an RTL wiki. I am not sure how many times it is used on the page, but I think fixing that can contribute to lowering the PEIS score since every single byte counts there.

What would you suggest to reduce the template size? The external <span class="..."> is necessary for styling the <a> tag, and the internal <span title="..."> is needed to override the partial hint created by the [[Link target]], which generates an <a href="..." title="Link target">.

When dealing with the PEIS, how would you handle with the following problem (see T275319#9814438): If a template verifies the existence of a parameter {{{1}}} using {{#if:{{{1|}}} | ... }}, the post-expansion size is increased by the length of {{{1}}} even if it doesn't appear in the post template-expansion text. How would you avoid the PEIS increase?

The external <span class="..."> is necessary for styling the <a> tag, and the internal <span title="..."> is needed to override the partial hint created by the [[Link target]], which generates an <a href="..." title="Link target">.

I get the point of external class, but I don’t get the point of the internal one. The code in the template is [[{{{1}}}|<span title="{{{1}}}"> — why is that needed? The link itself generates title="{{{1}}}", this is just duplication for no reason.

Regarding the second paragraph, in the case of this template, you could rewrite the entire thing to a Lua module and it would actually reduce PEIS size given that every #invoke:String right now increases PEIS to ×2 in the current code. If it was just <span class="law-external">{{#invoke:Module|main}}</span>, you would win on PEIS when compared to wikitext version.

I get the point of external class, but I don’t get the point of the internal one. The code in the template is [[{{{1}}}|<span title="{{{1}}}"> — why is that needed? The link itself generates title="{{{1}}}", this is just duplication for no reason.

When {{{1}}} contains an anchor, for example when the link points to [[Transportation Regulations#article 7]], the hint (title) shows "Transportation Regulations" instead of "Transportation Regulations#article 7". The destination might differ from what the text implies, so it's important to display it as a hint.

Regarding the second paragraph, in the case of this template, you could rewrite the entire thing to a Lua module and it would actually reduce PEIS size given that every #invoke:String right now increases PEIS to ×2 in the current code. If it was just <span class="law-external">{{#invoke:Module|main}}</span>, you would win on PEIS when compared to wikitext version.

I'll need to look into this. Lua might be a solution for specific scenarios, but generally, PEIS shouldn't penalize such templates.

Then it should be added only where anchors are involved. I suggest converting to a Lua module and adding those checks to make PEIS smaller. Since every page has a pretty long name, in most cases it is just wasteful.

And yet another law – The Israeli Pharmacist Ordinance – has just fallen over the limit...