Page MenuHomePhabricator

Establish a process for setting up local environments to explore the Commons Impact Metrics data dumps
Closed, ResolvedPublic

Description

Project Goal

Explore Commons Impact Metrics data dumps in local environment and create the docs so other people can recreate it.

Things we can try

  • Install an open-source database
  • Download the Commons Impact Metrics dumps
  • Import the data into the database
  • Write queries and explore the data
  • Install an open-source data visualization software
  • Create dashboards about Commons Impact Metrics
  • Bonus 1: Plot the category graph
  • Bonus 2: Set up a convenient docker image and publish it
  • We'll probably encounter issues and solve them together!

Possible Outcome

A documented process for setting up local environments for querying and visualizing Commons Impact Metrics data.
Here's the link to the notes document: https://docs.google.com/document/d/1qDx9-clzQGa6b3egCNc8jK0FSY1A0x6a75dNaYCTinc/edit

What we did:

Dumps download
We all were able to download the dumps manually without issues.
The bigger files took some significant time to download, but this depends highly on the user's bandwith.
Overall it seemed like an OK experience.

Import to database
We managed to import the data into DuckDB, SQLite, MySQL and MariaDB.
In general, we saw no critical issues.
Although, for MariaDB we had to spend some time adjusting collation configs to make it work.
A detail: Some relational databases do not support array types, and we needed to ingest the categories and primary_categories fields as strings.

Querying
In general the querying experience was OK.
Smaller datasets responded quickly without the use of indexes.
Queries on larger datasets took more time.
Unfortunately, the data model does not help to create indexes, since we'd want to create an index on the primary_categories field which is an array.
Thus, an alternative would be to pre-filter the data by primary_category; this would also help reduce the size of the data and allow users to import data for longer time ranges.

Download script
Because of the above problem we developed a Python script that downloads the files with a series of options, including filtering the files by primary category.
https://gist.github.com/marcelrf/4e45ef73e1a9d350c0170e1f78e58651

Data visualization
After installing Superset and connecting it to DuckDB, we tried the following visualizations using a dashboard. They are filtered by primary category 'Europeana'

Screenshot from 2024-05-05 12-39-08.png (808×1 px, 485 KB)

Screenshot from 2024-05-05 12-38-42.png (750×1 px, 410 KB)

Screenshot from 2024-05-05 12-38-34.png (750×1 px, 191 KB)

Screenshot from 2024-05-05 12-37-37.png (656×1 px, 79 KB)

Screenshot from 2024-05-05 12-38-17.png (536×1 px, 35 KB)

edits data - Commons Impact Metrics.png (932×1 px, 96 KB)

Adding parents to category dataset
When working on data visualization, we discovered that we were not able to plot the category graph, because the parent_category information was missing from the category dataset. Thus, we modified the dataset pipeline and the dumps pipeline to include that information.
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1027525

Data issues
We discovered an issue in one of the datasets, commons_edits, where the category and primary_categories are missing from some rows.
We'll troubleshoot asap!

Event Timeline

Hello! 👋 The 2024 Hackathon Program is open for scheduling! If you are still interested in organizing a session, you can claim a slot on a first-come, first-serve basis by adding your session to the daily program, following these instructions. We look forward to hearing your presentation!

Please be sure to add all the information needed to your task here, before scheduling a session on the daily program:

Title of session (the more detailed, the better!):
Session description:
Username for contact:
Session duration (25 or 50 min):
Session type (presentation, workshop, discussion, etc.):
Language of session (English, Arabic, etc.):
Prerequisites (some Python, etc.):
Any other details to share?:
Interested?

Change #1027525 had a related patch set uploaded (by Mforns; author: Mforns):

[analytics/refinery@master] Add parent categories to the Commons Impact Metrics dumps.

https://gerrit.wikimedia.org/r/1027525

Very very interesting! Seems to hopefully be useful as a successor to tools like https://glamtools.toolforge.org/baglama2/ :D

Change #1027525 merged by Mforns:

[analytics/refinery@master] Improve Commons Impact Metrics base pipeline and dumps

https://gerrit.wikimedia.org/r/1027525