Page MenuHomePhabricator

Make Wikidatawiki (mariadb SQL backend) horizontally scalable
Closed, InvalidPublicFeature

Description

Feature summary (what you would like to be able to do and where):
We want high quality trustable scalable infrastructure.
The Wikidata community does not want to bother or worry about technical limits(1).

Wikidata should just scale according to the needs of the community. Right now there is obviously a need to grow.

I for one would like to import millions of citations from Wikipedias since 2019 but have been holding back because of the WDQS scalability issues which have now hopefully been effectively mitigated by the split.

Additionally I would like to add millions of statements on existing citation items (authors(+1M), full text URLs(+50M available via unpaywall), semantic Scholar ID (+40M), etc. to improve existing items)

Also there is talk in the community about importing all the streets of whole countries so that could easily result in millions more items.

Some members would like to import all chemicals in the world and reports we are currently missing most of them. = Millions more items.

In short the community wants a system like Wikipedia where they don't have to bother about worrying about catastrophic failures.

Funding shouldn't be a problem, the Wikipedias are dependent on Wikidata, I very much assume the board is prepared to fund whatever it takes to make Wikidata scale and prevent catastrophic failures. Also WMF is rich with millions in the bank so that is also not a likely bottleneck.

WMF should just fix whatever issues without trying to install fear in the community when it comes to growth of both statements and items.

Technical solutions exist.

Telegram can scale to 900 million users with a ton of data across 100+ million chats
Facebook and google has far more data and systems and still work reliably.

How hard can it be when you have 100 million dollars in the bank and money to attract the sharpest kids on the block from a lot of countries?

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
A healthy Wikidata community that can focus on quality, completeness and serving the needs of other Wikipedia projects.

Right now we really can't because a lot of people are afraid the system will break if they do x. This is counterproductive and has lasted since 2019 when I joined the project.
The WMF ultimately risks:

  1. loosing contributors who are tired of getting their bot request for notable items or improvement of existing items opposed and rejected out of fear.
  2. splitting of the community, someone better able might fork the whole project and fix the issues without involving WMF and invite people over to the new working fork.
  3. the project might get abandoned by the users over time and become stale

Benefits (why should this be implemented?):
The whole Wikimedia ecosystem. Nobody wants a broken Wikidata that is not accepting new items or a bad quality pile of data or an unreliable backend with high latency.

Background:

Footnotes:
(1) this assumption is based on the recent mass import policy discussion in project chat and the fact that most come from other wiki projects where "don't worry about breaking the system" is part of the mindset

Event Timeline

Having a dedicated revision backend will make several tasks easier, e.g. T189412: Granular protection for Wikidata items, T217324: Have a more fine-grained history for property values on item pages. But there are much more to consider. For example, it is bad to introduce a mandatory 3rd party database as a requirement of Wikibase installation.

P.S.

  • An alternative revision backend is not an unfamiliar thing to WMF - Flow used a dedicated revision backend. Yeah, it results in problems: T325222#8477180
  • Commons structured data is also something we need to handle - it is a Wikibase repo with comparable amount of entities as Wikidata.

Another idea is introduce some sort of flat (RocksDB-like) secondary item store, so clients accessing Wikidata data can bypass Wikidata database completely. This is like parser cache, but it stores original entity data, and is persistent. This does not reduce the size of Wikidata database, but will reduce the number/frequency of queries, so this does not solve the issue completely but may reduce many of problems.
Querying wb_items_per_site forward side (item to page) can be replaced with queries to the new secondary storage; accessing statements of items can also use such storage. Page to item query should use the page_property table instead.
Such secondary storage can naturally splited to multiple shards, since we only need to support key to value query.

The ticket is too vague and too broad. It's like "Make Wikipedia Better". Horizontal scaling for what? Pages? Edits? Term store? Most importantly, there is already a solution for better scaling: Federation. The data of wikidata should not all be stored in wikidata. Citations should go to wikicite so they can have their own database which would make them horizontally scalable by default.

I have plans to work on improving horizontal scaling of mediawiki (in general) but the work on it will start at least two years from now. First we need to do basic schema improvements and low hanging fruit interventions.

I have plans to work on improving horizontal scaling of mediawiki (in general) but the work on it will start at least two years from now.

Yeah, what to do in near term (i.e. 2024-2025) is described in T297633#7646661. However there is another point to consider: s8 takes around 100,000 queries every second so it is meaningful to analyze the read pattern of Wikidata database. Some reads are not necessary or have alternative (e.g. item ID of a local page can be read from page property table instead of from Wikidata wb_items_per_site table).

Above (T375352#10169051) I proposed some ways that the number can be reduced. By introducing a secondary item store (a persistent equivalent of parser cache), we can move the usage in client wikis to read from a KV store that can horizontally scale.

So9q renamed this task from Make Wikidata horizontally scalable to Make Wikidatawiki horizontally scalable .Sep 26 2024, 10:24 AM

Thanks for the reply and reaction :)

The ticket is too vague and too broad. It's like "Make Wikipedia Better". Horizontal scaling for what? Pages? Edits? Term store? Most importantly, there is already a solution for better scaling: Federation.

Are you aware that federation does not work well? Even @Lydia_Pintscher does not believe in it if I understood correctly. Also note that Wikibase federation of properties DOESN'T really work yet and little effort seems to be put into it. Federation of all scholarly articles, their authors and related items is a monumental task.

I have not seen anyone willing to do the work to Wikibase that this would require nor fund it. So no federation at this time is not a viable solution to scaling wikidatawiki. An argument against this is commonswiki which does have federation to wikidata for depicts. That is very nice, but I'm not sure how much work it would take to make that happen for a dedicated Wikicite Wikibase as many more properties would be needed. E.g. https://www.wikidata.org/wiki/Q60120004 has 4 properties that is needed from Wikidata (instance of, published in, doi, abs-bibcode). A commons-like implementation would require a lot of properties for external identifiers to be federated (seamlessly to the user like in commons).

Keeping the master-n-cluster model is not a good idea either if you ask me because Wikidatawiki will not be able to scale 100x or even 2x from now because a single high-end server cannot keep everything in RAM (see https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata).

This means that when (not if) the community wants to import all art, all chemicals, all named streets, etc. you will have the same issue despite half the items being offloaded already to a Wikicite Wikibase. Are you going to make a new Wikistreets Wikibase, Wikistars Wikibase, etc? How would that affect a query for all scientific articles that mention a star? For every new Wikibase you have to change the federation because suddenly the main theme: some star is not in Wikidatawiki anymore but moved to Wikistars Wikibase.

As you can probably see this does not scale UI-wise. The users are going to rebell at some point against federation more than a few properties. We need mediawiki/wikidatawiki to scale horizontally to work towards the WMF goal of sharing in the sum of all knowledge.

The data of wikidata should not all be stored in wikidata.

Says who? The community? Is'nt this a community decision? Does WMF technicians decide what to store in enwiki? Which articles to include? How much the wiki can grow?

Citations should go to wikicite so they can have their own database which would make them horizontally scalable by default.

No, Wikibase cannot handle that enormous load of data, take a look at https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata and imagine importing all of OpenAlex into Wikibase. It's (currently) not designed to handle that amount of data and revisions it would take.

I have plans to work on improving horizontal scaling of mediawiki (in general) but the work on it will start at least two years from now. First we need to do basic schema improvements and low hanging fruit interventions.

I understand, is there a ticket for that future work?

So9q renamed this task from Make Wikidatawiki horizontally scalable to Make Wikidatawiki (mariadb SQL backend) horizontally scalable .Sep 26 2024, 10:44 AM

Currently all Wikidata properties are usable in Commons Metadata (though some are missing UI support). In a (long-term) future of "fediverse" of Wikibase instances, any Wikibase instances should able to use any properties/items in predefined set of other Wikibase instances. e.g. Wikibase A can use properties from another Wikibase B, C and itself. So I don't think the number of federated properties is a problem.

Personally I don't think a federated WikiCite Wikibase is a permanent viable solution currently. There is collections of more than 200 million papers, more than the number of items Wikidata current has. So Wikibase should at least be scalable for such number of items even if we set up a federated WikiCite Wikibase. Also, multiple Wikibase instance is more troublesome to manage and more difficult to quert than one Wikidata, I personally will suggest keep scholar paper data in Wikidata.

Currently all Wikidata properties are usable in Commons Metadata (though some are missing UI support).

Not all properties are usable in Commons. It's true that some aren't supported by the UI, but others cannot be added even using the API.

Are you aware that federation does not work well? Even @Lydia_Pintscher does not believe in it if I understood correctly.

No that is not what I believe. Quite the opposite. I've been preaching more than 5 years that the Wikibase Ecosystem is a big part of the future of Wikidata and that not all data will live in Wikidata.
Wikidata has come out of the semantic web community, which at its core is all about bringing data together from different places and making it work together. This is what we will do. And while federation in some of its definition does not work yet or not as well as we want it this is where resources will be put.

The data of wikidata should not all be stored in wikidata.

Says who? The community? Is'nt this a community decision? Does WMF technicians decide what to store in enwiki? Which articles to include? How much the wiki can grow?

Editors have editorial control of the wikis but that is within certain limits such as technical infeasibility. Please don't make this an us vs them debate.

Citations should go to wikicite so they can have their own database which would make them horizontally scalable by default.

No, Wikibase cannot handle that enormous load of data, take a look at https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata and imagine importing all of OpenAlex into Wikibase. It's (currently) not designed to handle that amount of data and revisions it would take.

Just to make sure that I understand you correctly, you're telling me to "take a look at" something I myself wrote?

Currently all Wikidata properties are usable in Commons Metadata (though some are missing UI support).

Not all properties are usable in Commons. It's true that some aren't supported by the UI, but others cannot be added even using the API.

Making them work is much easier task than making making mw be able to query multiple clusters transparently. I understand it's not perfect but that's the direction of horizontal scaling. Not just because of MySQL database but also downstream systems (query service) and user experience (just search for anything in wikidata, almost all has filled with unrelated results because of papers). In fact, MariaDB has scaled much better than other technical and non-technical components of wikidata but it has also its limits (and it's in different levels, e.g. revisions are too many, not necessarily items or triples while for WDQS it's the other way around)

Is it meaningful to reduce the number or frequency of Wikidata database? For clarification, this means introduce a persistent equivalent of parser cache, and should not have any impact of user-faced feature. See T375352#10169051.

For clarification this is what I proposed:
(1) create several (16 for example) clusters of servers, and assign each entity to one of them (such as MD5('Q1') starts with 8, so assign Q1 to cluster 8)
(2) add a job to update the store from Wikidata edits (note this store is secondary, i.e. do not affect revision storage of MediaWiki, and will only contain the most recent version)
(3) In each server, we install some kinds of K-V database. We can use mariadb/memcached/Openstack Swift/whatever, as long as it is persistent. Indexes are not required at all. We can also split statements, sitelinks and terms to different documents.
(4) If a client want content of Q1, we just read it from that store and Wikidata database is not involved at all

Is it meaningful to reduce the number or frequency of Wikidata database? For clarification, this means introduce a persistent equivalent of parser cache, and should not have any impact of user-faced feature. See T375352#10169051.

For clarification this is what I proposed:
(1) create several (16 for example) clusters of servers, and assign each entity to one of them (such as MD5('Q1') starts with 8, so assign Q1 to cluster 8)
(2) add a job to update the store from Wikidata edits (note this store is secondary, i.e. do not affect revision storage of MediaWiki, and will only contain the most recent version)
(3) In each server, we install some kinds of K-V database. We can use mariadb/memcached/Openstack Swift/whatever, as long as it is persistent. Indexes are not required at all. We can also split statements, sitelinks and terms to different documents.
(4) If a client want content of Q1, we just read it from that store and Wikidata database is not involved at all

Here is the issue. I listed the biggest tables of wikidata's database in https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata#Biggest_tables

The biggest table by far is revision and there is no easy way to split it (what to do with comment table? what to do with many joins done on it? ....) so if we split let's say term store, it'll save us some space but not much. We will move term store to its own cluster later this year (or next year). See T351802: Wikibase: Introduce separate database configuration for term store and the reads on term store actually cached in memcached which does what you said transparently already (but still we get a lot of reads on s8 because of term store).

If you're talking about storing statements, we don't store them in core dbs and thus they are not scalability concern for mariadb databases (WDQS is a different story).

Then you have the problem of how to split pagelinks which is an extremely large table in wikidata even though it's not used much.

I suggested some low hanging fruits in https://www.wikidata.org/wiki/User:ASarabadani_(WMF)/Growth_of_databases_of_Wikidata#Solutions e.g. we could limit pagelinks for properties to only 10,000 rows and not store the rest. That'd cut the size of pagelinks table to half (removing 70GB from each replica). Finding all of these "junk" data that doesn't provide much value to users and getting rid of them can save us a lot of trouble and money without taking away too much from the users.

reads from disk is around 1000 times slower than reads from memory and s8 takes around 100,000 queries every second (~2,000 queries a second per replica).

What I mean: should we try to first reduce the number, such as migrate reads that depends on just item and not specific revision elsewhere (to a secondary NoSql database), which will bypass the MariaDB database completely - this does not solve the issue of size of revision table, but reduce number of reads from it. Of course, doing so may have little benefit if most of reads are not page/revision related (i.e. related to tables other than page, revision, slots and content).

We heavily rely on memcached for reads, that absorbs around 99% of the reads and still we had to have 40 replicas for s8. The amount of reads is really high. Of course we could look into making the caching more robust but we did the low hanging fruits.