Problem:
As Wikidata PMs we need to better understand how much of Wikidata's graph consists of Labels, Descriptions, and Aliases, in order to make a good decision about how to split the Blazegraph database.
Questions:
- # of triples that describe Labels
- % of triples that describe Labels
- # of triples that describe Descriptions
- % of triples that describe Descriptions
- # of triples that describe Aliases
- % of triples that describe Aliases
How the data will be used:
- see T337799
What difference will these insights make:
- see T337799
Notes:
- The most recent numbers that we can get will do.
Assignee Planning
Information below this point is filled out by WMDE Analytics and specifically the assignee of this task.
Sub Tasks
Full breakdown of the steps to complete this task:
- Look into prior research on this topic
- Define tables to be used below
- Derive total triples
- 2023-7-10: 15,033,775,713
- 2023-7-19: 15,043,046,814
- Aggregate total and percent for labels
- 2023-7-10:
- Total: 801,847,766
- Percent: 5.334
- 2023-07-19:
- Total: 802,163,906
- Percent: 5.332
- 2023-7-10:
- Aggregate total and percent for descriptions
- 2023-7-10:
- Total: 2,877,509,113
- Percent: 19.14
- 2023-07-19:
- Total: 2,878,727,304
- Percent: 19.137
- 2023-7-10:
- Aggregate total and percent for aliases
- 2023-7-10:
- Total: 178,352,219
- Percent: 1.186
- 2023-07-19:
- Total: 178,333,657
- Percent: 1.185
- 2023-7-10:
- Putting results/process in a public place for future reference
- Where would this ideally be?
- github.com/wmde/wmde-analytics has been made
- @AndrewTavis_WMDE: I'll add this to the repo later when I add backlog tasks
- Where would this ideally be?
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- The discovery.wikibase_rdf table will be used for this
- Schemas are not documented for this table on Wikitech, but anyone with access to the analytics cluster can access it (as of 17-7-2023)
- The table includes subject-predicate-object relationships for Wikibase instances including Wikidata
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Related tasks:
- Prior analysis on this has been done in the following places:
- See in general work by: