Page MenuHomePhabricator

Make Wikidata horizontally scalable
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where):
We want high quality trustable scalable infrastructure.
The Wikidata community does not want to bother or worry about technical limits(1).

Wikidata should just scale according to the needs of the community. Right now there is obviously a need to grow.

I for one would like to import millions of citations from Wikipedias since 2019 but have been holding back because of the WDQS scalability issues which have now hopefully been effectively mitigated by the split.

Additionally I would like to add millions of statements on existing citation items (authors(+1M), full text URLs(+50M available via unpaywall), semantic Scholar ID (+40M), etc. to improve existing items)

Also there is talk in the community about importing all the streets of whole countries so that could easily result in millions more items.

Some members would like to import all chemicals in the world and reports we are currently missing most of them. = Millions more items.

In short the community wants a system like Wikipedia where they don't have to bother about worrying about catastrophic failures.

Funding shouldn't be a problem, the Wikipedias are dependent on Wikidata, I very much assume the board is prepared to fund whatever it takes to make Wikidata scale and prevent catastrophic failures. Also WMF is rich with millions in the bank so that is also not a likely bottleneck.

WMF should just fix whatever issues without trying to install fear in the community when it comes to growth of both statements and items.

Technical solutions exist.

Telegram can scale to 900 million users with a ton of data across 100+ million chats
Facebook and google has far more data and systems and still work reliably.

How hard can it be when you have 100 million dollars in the bank and money to attract the sharpest kids on the block from a lot of countries?

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
A healthy Wikidata community that can focus on quality, completeness and serving the needs of other Wikipedia projects.

Right now we really can't because a lot of people are afraid the system will break if they do x. This is counterproductive and has lasted since 2019 when I joined the project.
The WMF ultimately risks:

  1. loosing contributors who are tired of getting their bot request for notable items or improvement of existing items opposed and rejected out of fear.
  2. splitting of the community, someone better able might fork the whole project and fix the issues without involving WMF and invite people over to the new working fork.
  3. the project might get abandoned by the users over time and become stale

Benefits (why should this be implemented?):
The whole Wikimedia ecosystem. Nobody wants a broken Wikidata that is not accepting new items or a bad quality pile of data or an unreliable backend with high latency.

Background:

Footnotes:
(1) this assumption is based on the recent mass import policy discussion in project chat and the fact that most come from other wiki projects where "don't worry about breaking the system" is part of the mindset

Event Timeline

Having a dedicated revision backend will make several tasks easier, e.g. T189412: Granular protection for Wikidata items, T217324: Have a more fine-grained history for property values on item pages. But there are much more to consider. For example, it is bad to introduce a mandatory 3rd party database as a requirement of Wikibase installation.

P.S.

  • An alternative revision backend is not an unfamiliar thing to WMF - Flow used a dedicated revision backend. Yeah, it results in problems: T325222#8477180
  • Commons structured data is also something we need to handle - it is a Wikibase repo with comparable amount of entities as Wikidata.

Another idea is introduce some sort of flat (RocksDB-like) secondary item store, so clients accessing Wikidata data can bypass Wikidata database completely. This is like parser cache, but it stores original entity data, and is persistent. This does not reduce the size of Wikidata database, but will reduce the number/frequency of queries, so this does not solve the issue completely but may reduce many of problems.
Querying wb_items_per_site forward side (item to page) can be replaced with queries to the new secondary storage; accessing statements of items can also use such storage. Page to item query should use the page_property table instead.
Such secondary storage can naturally splited to multiple shards, since we only need to support key to value query.