Page MenuHomePhabricator

[Analytics] Baseline metrics for data dumps
Closed, ResolvedPublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!

Purpose

We need the following baseline metrics for data dumps per month:

  • # downloads (full v partial across the different types)
  • # downloads per unique ip x user agent
  • # unique user agents
  • Size of a full data dump in GB

Desired Outputs

Please list the desired outputs of this task.

Baseline metrics for Wikidata data dumps

Deadline

Please make the time sensitivity of this task clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

04.08.2025


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Explicitly define the metrics needed
  • Write all create table queries for the DAG jobs
  • Write all the computation queries for the DAGs
  • Test query processes in local DB schema
  • Write DAG(s)
  • Test DAG(s)
  • Deploy DAG(s)

Estimation

Estimate: 2-3 days
Actual:

Data

The tables that will be referenced in this task.

  • link_to_table

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

Ifrahkhanyaree_WMDE renamed this task from Baseline metrics for data dumps to [Analytics] Baseline metrics for data dumps.Jul 17 2025, 8:54 AM

Hallo halloo, just checking - would you have an idea of when I could have these numbers by?

Hellooo 👋 We just heard back that T403159 will still be a bit, so this will need to be a recurring task that done manually. I'd been hoping that the work for this could be a DAG. I'll try to get the notebook set up and get the first few months of data by EOW or early next week at the latest as I've been able to make a lot of progress on the ones that jumped ahead of this one :)

That sounds good! If I get some data for the last few months of this year, then I can use that to come up with a KR for next year. After that there's no urgency to this ticket, so no worries

Moving this into In Review :)

As of now we can't do a DAG for this process, so I'll be manually running a process defined in T399808_wd_dump_metrics.ipynb. We need to get the logs from stat1011, and then a process is ran to generate the data for the wmde.wd_dump_request_metadata_monthly and wmde.wd_dump_metrics_monthly tables. This information has since also been displayed on the wmde_wd_dump_metrics_monthly Superset dashboard.