Page MenuHomePhabricator

Data infrastructure for research datasets
Closed, ResolvedPublic

Description

Event Timeline

Weekly update:

  • transform into python package for library use
  • wip deploy airflow dags

updates:

  • added research-transform python package for reusable research focused code
  • added wip knowledge gap code
  • instructions interactive development, either from jupyter notebook or ipython/VS code

updates:

  • generic spark job airflow dag template
  • configurable airflow dag to download media from swift
leila renamed this task from Wikimedia research code repository to Implement research-dataset library.Apr 4 2023, 7:24 PM
leila triaged this task as High priority.
leila moved this task from In Progress to FY2022-23-Research-April-June on the Research board.
fkaelin renamed this task from Implement research-dataset library to Data infrastructure for research datasets.Apr 15 2023, 2:56 AM
fkaelin updated the task description. (Show Details)

Updates

  • Research now has a dedicated hdfs data directory: /wmf/data/research
  • Moved datasets: pageviews_daily (used by knowledge_gaps), html enterprise dumps, wikidiff
  • Next up: article quality, content gaps

Updates

This work is done, resolving as closed.