- provide a "home" for research datasets in the data engineering infrastructure
- datasets generated via scheduled airflow jobs
- library to access datasets using spark in https://gitlab.wikimedia.org/repos/research/research-common
Description
Description
Event Timeline
Comment Actions
updates:
- created wikimedia gitlab repo for ml related research code https://gitlab.wikimedia.org/FabianKaelin/research-ml
Comment Actions
- added research-transform python package for reusable research focused code
- added wip knowledge gap code
- instructions interactive development, either from jupyter notebook or ipython/VS code
Comment Actions
updates:
- generic spark job airflow dag template
- configurable airflow dag to download media from swift
Comment Actions
Updates
- Research now has a dedicated hdfs data directory: /wmf/data/research
- Moved datasets: pageviews_daily (used by knowledge_gaps), html enterprise dumps, wikidiff
- Next up: article quality, content gaps
Comment Actions
Updates
- moved articles features to research data directory, populated with an incremental airflow dag,
- the article features are available as en external hive research.article_features. the article quality scores are stored in research.article_quality_scores
- a library wrapper to access datasets generated by research via spark is available in https://gitlab.wikimedia.org/repos/research/research-common/-/blob/main/research_common/datasets.py