Project Goal
---
Explore [[ https://wikitech.wikimedia.org/wiki/Commons_Impact_Metrics | Commons Impact Metrics ]] data dumps in local environment and create the docs so other people can recreate it.
Things we can try
---
* Install an open-source database
* Download the Commons Impact Metrics dumps
* Import the data into the database
* Write queries and explore the data
* Install an open-source data visualization software
* Create dashboards about Commons Impact Metrics
* Bonus 1: Plot the category graph
* Bonus 2: Set up a convenient docker image and publish it
* We'll probably encounter issues and solve them together!
Possible Outcome
---
A documented process for setting up local environments for querying and visualizing Commons Impact Metrics data.
Here's the link to the notes document: https://docs.google.com/document/d/1qDx9-clzQGa6b3egCNc8jK0FSY1A0x6a75dNaYCTinc/edit
What we did:
---
**Dumps download**
We all were able to download the dumps manually without issues.
The bigger files took some significant time to download, but this depends highly on the user's bandwith.
Overall it seemed like an OK experience.
**Import to database**
We managed to import the data into DuckDB, SQLite, MySQL and MariaDB.
In general, we saw no critical issues.
Although, for MariaDB we had to spend some time adjusting collation configs to make it work.
A detail: Some relational databases do not support array types, and we needed to ingest the categories and primary_categories fields as strings.
**Querying**
In general the querying experience was OK.
Smaller datasets responded quickly without the use of indexes.
Queries on larger datasets took more time.
Unfortunately, the data model does not help to create indexes, since we'd want to create an index on the primary_categories field which is an array.
Thus, an alternative would be to pre-filter the data by primary_category; this would also help reduce the size of the data and allow users to import data for longer time ranges.
**Download script**
Because of the above problem we developed a Python script that downloads the files with a series of options, including filtering the files by primary category.
https://gist.github.com/marcelrf/4e45ef73e1a9d350c0170e1f78e58651
**Data visualization**
After installing Superset and connecting it to DuckDB, we tried the following visualizations using a dashboard. They are filtered by primary category 'Europeana'
{F50317531}
{F50317530}
{F50317526}
{F50317524}
{F50317523}
**Adding parents to category dataset**
When working on data visualization, we discovered that we were not able to plot the category graph, because the parent_category information was missing from the category dataset. Thus, we modified the dataset pipeline and the dumps pipeline to include that information.
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1027525
**Data issues**
We discovered an issue in one of the datasets, commons_edits, where the category and primary_categories are missing from some rows.
We'll troubleshoot asap!