Problem:
As Wikidata PMs we need to better understand how much of Wikidata's graph consists of the scholarly articles subgraph, in order to make a good decision about how to split the Blazegraph database.
Questions:
What is the size of the direct instances of Q13442814 (scholarly article) in Wikidata
- # of triples
- % of triples
- # of Items (optional)
- % of Items (optional)
How the data will be used:
- see T337799
What difference will these insights make:
- see T337799
Notes:
- The most recent numbers that we can get will do.
Open questions:
- The triple table follows a different logic than Wikibase tables. What is the exact definition of triples that we should include in the counts?
Assignee Planning
Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.
Sub Tasks
Full breakdown of the steps to complete this task:
- Define tables to be used below
- Derive aggregate and percentage data
- Date for the below metrics is 20230717 within the discovery.wikibase_rdf table. An HTML of the work for this task can be found here.
- Derive # of triples
- Total Wikidata triples in discovery.wikibase_rdf: 15,043,483,216
- Total direct SA triples: 7,188,746,257
- Total SA triples with vals and refs that are NOT unique to SAs: 7,521,423,558
- Total SA triples with vals and refs that ARE unique to SAs: 7,521,225,975
- Derive % of triples
- Percent direct SA triples: 47.7864%
- Percent SA triples with vals and refs that are NOT unique to SAs: 49.9979%
- Percent SA triples with vals and refs that ARE unique to SAs: 49.9966%
- Derive # of Items
- Total distinct QIDs: 108,265,975
- Total SA QIDs: 40,300,769
- Derive % of Items
- Percent SA QIDs: 37.2239%
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- The discovery.wikibase_rdf table will be used for this for the aggregate and percent of triples
- wmf.wikidata_entity can then be used for the aggregate and percentage values for items
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Related task: T337021: [Analytics] Find out size of term subgraph
- Related task: T342123: [Analytics] Find out the size of the Q13442814 (scholarly article) subgraph (including instances of subclasses)
- Prior related task: T281854: Get baseline measurements/expectations for splitting scholarly articles from Wikidata
- Prior related analysis: Wikidata_Scholarly_Articles_Subgraph_Analysis
- See: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis.
- See: Preparations for WDQS graph-splittig