Page MenuHomePhabricator

Generate training/evaluation datasets using airflow
Closed, ResolvedPublic

Description

Add support for generating training/test/evaluation datasets for
revert risk models using airflow dags

  • discuss and decide on output format
  • store in /wmf/data/research/datasets/
  • integrate with datahub for discoverability
  • should this job be scheduled, e.g. monthly?

Details

Due Date
Nov 30 2023, 5:00 AM
TitleReferenceAuthorSource BranchDest Branch
Add spark job for generating revertrisk multilingual datasetsrepos/research/revertrisk-datasets!1mnzmnz/multilingualmain
Customize query in GitLab

Event Timeline

fkaelin created this task.
fkaelin moved this task from Backlog to Staged on the Research board.
fkaelin triaged this task as High priority.Jul 27 2023, 6:20 PM
fkaelin set Due Date to Sep 8 2023, 4:00 AM.

The first dataset generation pipeline (https://gitlab.wikimedia.org/repos/research/revertrisk-datasets) is complete, including the corresponding airflow dag (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/research/dags/revert_risk_multilingual_dataset_dag.py). For ML training datasets, this is a good starting point to collect more experience to answer open questions in the description.

Research engineering is starting work on two related projects:

  • T343065 is a pipeline that creates a dataset is used to populate a dashboard
  • A pipeline to create embedding datasets for use in projects based on vector similarity search

This has lead to additional questions around code organization and re-use of code, beyond the use case of machine learning pipelines, the answers which will be useful for this task as well. Pushing back the due date as we will revisit this task after progress has been made on the related work mentioned above.

fkaelin lowered the priority of this task from High to Medium.Oct 6 2023, 2:41 AM
fkaelin changed Due Date from Sep 8 2023, 4:00 AM to Nov 30 2023, 5:00 AM.

@MunizaA / @fkaelin: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!