Purpose
In T362849: [Analytics] Items that contain a sitelink to one of the Wikimedia projects over time we need to calculate historical segments of Wikidata's items based on their relation to sitelinks.
Purpose from that ticket:
As Wikidata Product Managers, we would like to understand how different segments of Wikidata's data developed over time, so we can inform our projections.
This task would encompass the historical data that's needed to achieve this.
Scope
From T362849:
How did the number of Items of the following types develop over time?
A) Items that contain a sitelink to one of the Wikimedia projects (e.g. about a notable person) B) Items that are needed to build A (used in A Items for example in a statement or reference; e.g. the non-notable father of that notable person) C) All other Items
- In order to do this, T363451: Add job to create Wikidata partition to wmf.mediawiki_wikitext_history was made to recreate the Wikidata partition of wmf.mediawiki_wikitext_history
- Once this task is complete, work can then begin to use this partition to generate all data from when Wikidata was created to the most recent weekly data generated by the DAG created in T362849
Desired Output
- Weekly stats of the number of Items in category A, B and C
Acceptance criteria:
- Weekly historical breakdowns of populations A, B and C
- These would be in the Data Lake and the published datasets
Information below this point is filled out by the Wikidata Analytics team.
General Planning
Information is filled out by the analytics product manager.
Assignee Planning
Information is filled out by the assignee of this task.
Estimation
Estimate:
Actual:
Sub Tasks
Full breakdown of the steps to complete this task:
- Step
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Note