Problem:
As Wikidata PMs we need to understand better how much of Wikidata's graph consists of the scholarly articles subgraph, to make a good decision about splitting the Blazegraph database.
Questions:
- Is the ontology clean enough to include all subclasses of Q13442814 (scholarly article) or does that lead to unexpected results?
What is the size of the instances of Q13442814 (scholarly article) including all instances of only the subclasses that AKhatun used?(out of scope, see T342123#9076994)
- What is the size of the instances of Q13442814 (scholarly article) including all instances of all direct (wdt:P279) subclasses?
- # of triples
- % of triples
- # of Items (optional)
- % of Items (optional)
How the data will be used:
- see T337799
What difference will these insights make:
- see T337799
Notes:
- The most recent numbers that we can get will do.
Assignee Planning
Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.
Sub Tasks
Full breakdown of the steps to complete this task:
- Define tables to be used below
- Base investigation of the subclasses of Q13442814
- Find the subclasses that were considered by AKhatun and add them to the task's description
- List the subclasses of scholarly article so we can have an overview of what's being included
- Note that there are at times papers listed as subclasses of scholarly article, but these errors are fixed quickly
- Results from https://w.wiki/7DJX on 8/8/23:
- doctoral thesis: Q187685
- working paper: Q1228945
- eprint: Q1347686
- systematic review: Q1504425
- A-publication: Q2774197
- Realist Evaluation: Q7301211
- retraction notice: Q7316896
- review article: Q7318358
- scientific conference paper: Q10885494
- research article: Q15706459
- academic journal article: Q18918145
- expression of concern editorial notice: Q56478376
- classical article: Q58898396
- Corrected and Republished Article: Q58900805
- historical article: Q58901470
- introductory journal article: Q58902427
- survey article: Q60535861
- medical scholarly article: Q82969330
- opinion paper: Q92998777
- research commentary: Q93003322
- executable paper: Q99770806
- scoping review: Q101116078
- legal article: Q108196115
- scholarly letter/reply: Q110716513
- reply paper: Q114413783
- sleeping beauty: Q115528532
- prince: Q115546988
- Derive aggregate and percentage data for all direct (wdt:P279) subclasses of Q13442814
- Derive # of triples
- Total Wikidata triples in discovery.wikibase_rdf: 15,043,483,216
- Total direct SA and subclass triples: 7,196,453,128
- Total SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 7,529,369,734
- Total SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 7,529,130,686
- Derive % of triples
- Percent direct SA and subclass triples: 47.8377%
- Percent SA and subclass triples with vals and refs that are NOT unique to SAs and subclasses: 50.0507%
- Percent SA and subclass triples with vals and refs that ARE unique to SAs and subclasses: 50.0491%
- Derive # of Items
- Total distinct QIDs: 108,265,975
- Total SA and subclass QIDs: 40,403,721
- Derive % of Items
- Percent SA and subclass QIDs: 37.3189%
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- The discovery.wikibase_rdf table will be used for this for the aggregate and percent of triples and items
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Related task: T337021: [Analytics] Find out size of term subgraph
- Related task: T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
- Prior related task: T281854: Get baseline measurements/expectations for splitting scholarly articles from Wikidata
- Prior related analysis: Wikidata_Scholarly_Articles_Subgraph_Analysis
- See: https://wikitech.wikimedia.org/wiki/User:AKhatun/Wikidata_Subgraph_Analysis.
- See: Preparations for WDQS graph-splittig