Purpose
As Wikidata Product Managers, we would like to understand how different segments of Wikidata's data developed over time, so we can inform our projections.
Scope
- How did the number of Items of the following types develop over time?
- A) Items that contain a sitelink to one of the Wikimedia projects (e.g. about a notable person)
- B) Items that are needed to build A (used in A Items for example in a statement or reference; e.g. the non-notable father of that notable person)
- C) All other Items
Desired output
- Weekly stats of the number of Items in category A, B and C
Acceptance criteria
- Current numbers for A, B and C to verify the approach
-
Historic weekly data for A, B, and C (for all snapshots that are available)- See T363583
- Weekly data process to capture new data
- Waiting on confirmation to transfer notebook Spark SQL to jobs codebase
- Public output of data in the form of a table (and ideally diagram)
- Ready within the already written DAG
- Diagram not included here as it's blocked by current infrastructure
Information below this point is filled out by the Wikidata Analytics team.
General Planning
Information is filled out by the analytics product manager.
Assignee Planning
Information is filled out by the assignee of this task.
Estimation
Estimate: 3 days
Actual: ~4 days (scope change)
Sub Tasks
Full breakdown of the steps to complete this task:
- Explore wmf.wikidata_entity
- Derive method of determining items in Population A
- Derive method of determining items in Population B
- Combine Population A and B methods into a single query that allows for the creation of Population C
- Write and test DAG
- Write and test DAG jobs
- Deploy and check DAG
Data to be used
See Analytics/Data_Lake for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
Notes and Questions
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Note