Page MenuHomePhabricator

Store revisions of wikibase items in an external triple store
Closed, DeclinedPublic

Description

Currently it is quite complicated to get data from wikibase. To get data regarding item Q5 one needs an SQL statement similar to

SELECT old_text from page
         JOIN revision on page_id = rev_page
         JOIN content on rev_sha1 = content_sha1
         JOIN text on old_id = Substr(content_address,4)
         where page_title = 'Q5'

which is not very efficient. Then one can start parsing the json and extract the data.

I think it would be more efficient to use ExternalStore (manual) to store the revisions. Codesearch showed that the extension Flow had used external storage in the past. However, they seem to have used a relational database.

Anyhow, it seems to be more obvious to store wikibase items in a triple store directly and query that triple store (or a derivative of that) rather than keeping a separate triple store in sync with the latest revisions.

What am I missing?

Event Timeline

(1) The canonical data format of Wikibase entities are JSON, not RDF. RDF is actually a format with more redundency - e.g. values may be stores three times (trusty value, main simple value and main full value) for querying purpose.
(2) The actual content of all pages/revisions in Wikimedia wikis are stored in another database cluster which can be treated as a flat (auto-assigned) key to value store; it still uses MariaDB but works like a NoSQL database. Only revision metadata is stored in Wikidata. i.e. The above query will only return a pointer to external storage in Wikimedia wikis (and will stop working soon, since such pointers are to be moved to content table, bypassing text table completely).
(3) Flow does use a dedicated revision backend that is cross wiki and unrelated to the MediaWiki revision backend. Introduce a new revision backend like this for Wikidata is theorically possible but a huge work.

I will however support introducing RocksDB-like NoSQL database as a secondary storage of current version of Wikidata entities, in order to reduce the load (i.e. number of queries) of main Wikidata database (see T375352#10184517), but such database should be secondary in nature so it does not replace the MediaWiki revision system.

I will however support introducing RocksDB-like NoSQL database as a secondary storage of current version of Wikidata entities, in order to reduce the load (i.e. number of queries) of main Wikidata database (see T375352#10184517), but such database should be secondary in nature so it does not replace the MediaWiki revision system.

I like this idea.

We could also have a immutable RocksDB instance based on the latest dump and cache it using LRU. That would be very cheap to setve compared to the live data and most users don't really need the live data at all.

A week old is good enough for virtually all REST traffic not used by editors for curation (my guess, I don't think this has ever been explored by WMDE). A single beefy server could probably serve most if not all the current REST requests to the live database.

But for this to work we would need to nudge users to the cheap time lagging endpoints.

This is even true for requests coming from Wikipedia. We could easily change all lookups via LUA to use this cached time lagging endpoint instead and inform users accordingly.

I have to admit that I don't fully understand how rocks db would help only for Wikibase. We are looking into a way to continue support for Wikibase knowledge graph queries. While one might switch from Sparql to another language the semantics is in the links between things. Thus storing the revision as text (either Json or RDF) seems to be suboptimal and an artifact as the capabilities of mediawiki to store revisions were much more limited when Wikidata was launched.
This is why I think we might want to investigate there are better ways to store the information. Even there MySQL can store the structure much better. For example the pagelinks table almost duplicates the revision information.

We discussed that at the search office hour and this is nothing that is planned. It might be done as a research project, but is nothing that is likely to happen.