Page MenuHomePhabricator

Request for input on zhwikisource storage limit
Open, Needs TriagePublic

Description

Background: China Judgments Online Preservation Program.
In December 2025, I proposed importing Chinese court judgment documents to zhwikisource (zhwikisource proposal, metawiki discussion). On January 28, 2026, zhwikisource community approved the proposal.

However, the amount of material to import is very large: about 85 million new pages, roughly a 20-60x increase compared to the current size of zhwikisource. (Linking to Wikidata items is not planned at all in the formal proposal, for the same reason.) Given the scale, I would want to know if the current zhwikisource can actually hold that many pages.

Nemoralis from the Wikisource Telegram group chat mentioned that a request could be made to migrate zhwikisource to a new db cluster. Therefore, if the zhwikisource could not hold that many pages, I request a storage upgrade of zhwikisource.

Event Timeline

[This is not about All-and-every-Wikisource but about a single site, thus removing project tag]

I would want to know if the current zhwikisource can actually hold that many pages.

How many pages is this about? It seems 85 millions?

Which approximate size is this about?

I am not sure what DBA should be doing here, can you elaborate a bit what you'd need from us? In terms of database size, it is tiny.

Note only the metadata of revisions (about 100 bytes per revision) are stored at the database. The actual text is stored at External Storage which is shared among all wikis.

I am not sure what DBA should be doing here, can you elaborate a bit what you'd need from us? In terms of database size, it is tiny.

Yes, my estimation is that the 85 million pages would take ~27 MiB by raw content. The main issue is the amount of pages.

To be more specific, would the sheer amount of pages makes the site slower? It would be reassuring if it does not.

85M pages is a lot. It would immediately make zhwikisource the third or fourth largest wiki. The text itself is not an issue. It'll go to ES which is an append-only cluster and we can add more hardware easily but explosion of templatelinks/pagelinks/etc. will cause issues.

Things to consider:

  • These extensions shouldn't be enabled in the wiki: DPL (or called intersection). That caused major outages due to bot imports before and FlaggedRevs (also called pending changes). Maybe FR would be okay under "protect mode"
  • Run it slowly. Use the same edit summary to avoid explosion of comment table.
  • We could probably move the wiki to large.dblist.

(and more? I need to think a bit)

85M pages is a lot. It would immediately make zhwikisource the third or fourth largest wiki. The text itself is not an issue. It'll go to ES which is an append-only cluster and we can add more hardware easily but explosion of templatelinks/pagelinks/etc. will cause issues.

It is not _that_ easy to add new hardware. While I think I understand what you mean, I just want to make sure it is clear that adding hardware "on-demand" isn't something we can do on a normal basis.

85M pages is a lot. It would immediately make zhwikisource the third or fourth largest wiki. The text itself is not an issue. It'll go to ES which is an append-only cluster and we can add more hardware easily but explosion of templatelinks/pagelinks/etc. will cause issues.

It is not _that_ easy to add new hardware. While I think I understand what you mean, I just want to make sure it is clear that adding hardware "on-demand" isn't something we can do on a normal basis.

Yes. Sorry I should have been clearer. Comparatively, it's easier to expand ES than deal with core cluster getting too large. Also worth noting that the text get compressed and ES reads a lot more from disk and etc. i.e. ES is designed to store large blobs of text which mediawiki core dbs are not designed for really large number of pages.

I have suggested to @Supergrey1 that we may prepare all the pages on a self-hosted MediaWiki instance (or otherwise prepare the XML-formatted documents), then hand over the XML dumps to the Foundation and import all the pages from the backend. That may relieve the stress on the HTTP APIs, allows more flexibility on the technical teams, and therefore speeds up the process. Is this possible, and would this be preferred over bots that use the API?

Bots are definitely preferred. The import functionality in mediawiki is not designed for large scale imports and when language committee members try to import from incubator (for new wikis) which is at most like 1,000 pages, they run into issues constantly. Once they had to split it to three small xml files it took them a full day to deal with it.

Bots are definitely preferred. The import functionality in mediawiki is not designed for large scale imports and when language committee members try to import from incubator (for new wikis) which is at most like 1,000 pages, they run into issues constantly. Once they had to split it to three small xml files it took them a full day to deal with it.

Got it, thanks.

move the wiki to large.dblist.

I believe we should fix T291892: Overhaul or delete makeSizeDBLists.php and run it periodically. In the meanwhile, the last systematic update of that is in 2019.

@Supergrey1 In your trial runs, you used different edit summaries for different pages. Per sysadmin advice, you should use the same edit summary to avoid explosion of comment table. It's fine to have the links in-page, as the page content is going to vary anyway.

@Supergrey1 In your trial runs, you used different edit summaries for different pages. Per sysadmin advice, you should use the same edit summary to avoid explosion of comment table. It's fine to have the links in-page, as the page content is going to vary anyway.

Okay, I'll remove source links from edit summaries.

@Supergrey1 Just remind you, you should use the same edit summary for each type of your edits to avoid explosion of comment table, including the edits of redirect pages, disambiguation pages and so on.

@Supergrey1 Just remind you, you should use the same edit summary for each type of your edits to avoid explosion of comment table, including the edits of redirect pages, disambiguation pages and so on.

Good catch! Fixed those edits as well. The bot script should no longer produce unique edit summaries.