What is this task?
This task is used for planning and organizing only. To comment on the project or discuss it, please use one of the linked tasks instead.
Description of the main objective:
WDQS is reaching its limits (T335067) and we will soon need to split Wikidata's Blazegraph database (T337013) so that the service can continue functioning. The tasks in this epic will help us to decide how to best split the database.
There are two conflicting goals:
- We want to tell our users a simple story of where they can query what information.
- We want the resulting queries to be simple (ideally not requiring federation).
Milestone 1: Initial test of basic approaches for splitting Blazegraph into several subgraphs
- T337020: [Analytics] Understand the size and connectedness of Wikimedia-internal concepts in the Wikidata graph
- T337021: [Analytics] Find out size of term subgraph
- T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
Milestone 2: Investigations based on initial tests
- T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
Out of scope:
- Usage data
- There is no data T292152#7395796
- Authors that are only relevant for scientific papers
- The main source for this is discovery.wikibase_rdf
- This data is a processed version of the wikibase RDF dumps fit for WDQS (see https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences).
- Most notably the TTL dumps have the labels 3 times under 3 different predicates, we do only keep one version.
- For querying the dataset, we are advised to use Spark from a Jupyter notebook running from a stat box (see https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter)
- We will get every support to understand the triple table via the Search Platform IRC channel (Wikimedia-Search)