Generate training/evaluation datasets using airflow
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fkaelin
	Jul 27 2023, 5:50 PM

Description

Add support for generating training/test/evaluation datasets for
revert risk models using airflow dags

discuss and decide on output format
store in /wmf/data/research/datasets/
integrate with datahub for discoverability
should this job be scheduled, e.g. monthly?

Details

Due Date: Nov 30 2023, 5:00 AM

	Title	Reference	Author	Source Branch	Dest Branch
	Add spark job for generating revertrisk multilingual datasets	repos/research/revertrisk-datasets!1	mnz	mnz/multilingual	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		fkaelin	T341817 Standardize research pipelines - Dataset generation
		Resolved		MunizaA	T342915 Generate training/evaluation datasets using airflow

Event Timeline

fkaelin reassigned this task from fkaelin to MunizaA.Jul 27 2023, 5:50 PM

fkaelin created this task.

fkaelin moved this task from Backlog to Staged on the Research board.

fkaelin triaged this task as High priority.Jul 27 2023, 6:20 PM

fkaelin set Due Date to Sep 8 2023, 4:00 AM.

fkaelin mentioned this in T343063: Multilingual revert risk pipeline.Jul 29 2023, 3:07 AM

fkaelin mentioned this in T341817: Standardize research pipelines - Dataset generation.Jul 29 2023, 5:12 AM

mnz opened https://gitlab.wikimedia.org/repos/research/revertrisk-datasets/-/merge_requests/1

Draft: Add spark job for generating revertrisk multilingual datasets

mnz merged https://gitlab.wikimedia.org/repos/research/revertrisk-datasets/-/merge_requests/1

Add spark job for generating revertrisk multilingual datasets

Maintenance_bot removed a project: Patch-For-Review.Sep 1 2023, 3:10 PM

The first dataset generation pipeline (https://gitlab.wikimedia.org/repos/research/revertrisk-datasets) is complete, including the corresponding airflow dag (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/research/dags/revert_risk_multilingual_dataset_dag.py). For ML training datasets, this is a good starting point to collect more experience to answer open questions in the description.

Research engineering is starting work on two related projects:

T343065 is a pipeline that creates a dataset is used to populate a dashboard
A pipeline to create embedding datasets for use in projects based on vector similarity search

This has lead to additional questions around code organization and re-use of code, beyond the use case of machine learning pipelines, the answers which will be useful for this task as well. Pushing back the due date as we will revisit this task after progress has been made on the related work mentioned above.

fkaelin lowered the priority of this task from High to Medium.Oct 6 2023, 2:41 AM

fkaelin changed Due Date from Sep 8 2023, 4:00 AM to Nov 30 2023, 5:00 AM.

@MunizaA / @fkaelin: Hi, the Due Date set for this open task passed a while ago.
Could you please either update or reset the Due Date (by clicking Edit Task), or set the status of this task to resolved in case this task is done? Thanks!

fkaelin closed this task as Resolved.Tue, Apr 16, 2:54 PM

Generate training/evaluation datasets using airflow Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Generate training/evaluation datasets using airflow
Closed, ResolvedPublic
Actions

Related Objects
Search...