Project Goal
Explore Commons Impact Metrics data dumps in local environment and create the docs so other people can recreate it.
Things we can try
- Install an open-source database
- Download the Commons Impact Metrics dumps
- Import the data into the database
- Write queries and explore the data
- Install an open-source data visualization software
- Create dashboards about Commons Impact Metrics
- Bonus 1: Plot the category graph
- Bonus 2: Set up a convenient docker image and publish it
- We'll probably encounter issues and solve them together!
Possible Outcome
A documented process for setting up local environments for querying and visualizing Commons Impact Metrics data.
Here's the link to the notes document: https://docs.google.com/document/d/1qDx9-clzQGa6b3egCNc8jK0FSY1A0x6a75dNaYCTinc/edit
What we did:
Dumps download
We all were able to download the dumps manually without issues.
The bigger files took some significant time to download, but this depends highly on the user's bandwith.
Overall it seemed like an OK experience.
Import to database
We managed to import the data into DuckDB, SQLite, MySQL and MariaDB.
In general, we saw no critical issues.
Although, for MariaDB we had to spend some time adjusting collation configs to make it work.
A detail: Some relational databases do not support array types, and we needed to ingest the categories and primary_categories fields as strings.
Querying
In general the querying experience was OK.
Smaller datasets responded quickly without the use of indexes.
Queries on larger datasets took more time.
Unfortunately, the data model does not help to create indexes, since we'd want to create an index on the primary_categories field which is an array.
Thus, an alternative would be to pre-filter the data by primary_category; this would also help reduce the size of the data and allow users to import data for longer time ranges.
Download script
Because of the above problem we developed a Python script that downloads the files with a series of options, including filtering the files by primary category.
https://gist.github.com/marcelrf/4e45ef73e1a9d350c0170e1f78e58651
Data visualization
After installing Superset and connecting it to DuckDB, we tried the following visualizations using a dashboard. They are filtered by primary category 'Europeana'
- Screencast recording of the dashboard: https://drive.google.com/file/d/131To_1sJ5G4l2TIjBCleDSRbATl3yDny/view?usp=sharing (couldn't upload to Phab due to large file size)
Adding parents to category dataset
When working on data visualization, we discovered that we were not able to plot the category graph, because the parent_category information was missing from the category dataset. Thus, we modified the dataset pipeline and the dumps pipeline to include that information.
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1027525
Data issues
We discovered an issue in one of the datasets, commons_edits, where the category and primary_categories are missing from some rows.
We'll troubleshoot asap!





