Page MenuHomePhabricator

Share code between Research & ML teams
Closed, ResolvedPublic

Description

Goal: Make Research's reusable ML tooling easy to consume in ml-pipelines, reducing duplication and easing migrations—without heavy frameworks.

Deliverables:

  • Shared code project in ml-pipelines (name TBD)
    • Own deps + CI (tests incl. Spark snapshot, lint, build, publish)
    • Publish a wheel to GitLab PyPI for ML projects and Research
    • Seed from research-datasets (research_transformation, etc.); replace in-tree copies
    • Pin offline workload deps (e.g., Spark/Iceberg); notebook-friendly usage
    • Utilities only; no framework base classes
  • Command API (opt-in)
    • Generate typed dataclasses from entry-point args; use in Airflow tasks
    • Better defaults, type safety, IDE hints; easier usage in Jupyter
    • Pilot in one project before broader adoption

Affected repos:

  • Migrate shared code from research-datasets → new shared project in ml-pipelines
  • Update dependent projects to import the published wheel
  • Minimal DAG updates in airflow-dags for pilot if needed

Out of Scope

  • Migrating specific pipelines (revert-risk, add-a-link, inference, etc.)
  • Mandating Command API across all pipelines

Acceptance

  • Shared project created; CI green; wheel published; brief docs
  • At least one ML project consumes the package
  • Command API used in one pilot

Details

Due Date
Dec 30 2025, 12:00 AM

Event Timeline

Thanks @fkaelin for creating this task. We discussed this with @isarantopoulos and there are a few things we should figure out before we start this migration:

  • Ownership of the migrated code
  • What pipelines from Research does ML need? Do we need to migrate everything?
  • What are the dependencies?
  • How do we work together for this migration (roles and responsibilities, workflows etc)

@isarantopoulos please let me know if I am missing something. Happy to create a separate task for the above.

Miriam set Due Date to Sep 29 2025, 11:00 PM.Jul 28 2025, 9:53 AM
Miriam moved this task from Staged to In Progress on the Research board.
Miriam triaged this task as Medium priority.Aug 26 2025, 11:00 AM
Miriam changed Due Date from Sep 29 2025, 11:00 PM to Dec 30 2025, 12:00 AM.Oct 16 2025, 12:58 PM

Weekly updates

  • The suggested contributions are described in this doc
  • Ongoing discussion regarding ML dev tooling (with notebook support). Prototype repo is wmfing, replacing the archived research-commons repo.

Very cool!

The 'Command API' sounds like a very useful thing outside of just ML and Research jobs. As you build it, perhaps it would make sense to consult with Data Engineering @amastilovic @mforns @xcollazo etc. to see if there might be a way to build it for general use?

Weekly updates

  • started implementation of commons-utils as a project in ml-pipeline. Initial focus is on ci integration to publish a wheel, and adding it is a dependency to another project.

Updates:

  • The previously unused common_utils module is revamped into a shared python project. A gitlab CI job can publish a versioned wheel to the package registry. Other projects in the ml-pipelines can depend on the published wheel, or via a local path for development. This allows projects to depend on different version of common_utils if needed. The shared code from research-datasets repo was moved with no changes, including the unit tests.
  • The add-a-link project in ml-pipelines now depends on the new common_utils project. Previously that shared code was copied to the add-a-link project directly, the duplicate code was removed.
  • There is a new development blubber variant that defines a base docker image for development. This development image can be used to run tests locally, or to start a local jupyter server for development. Documented in the readme.
  • A development dependency to wmfing was added to facilitate development of spark based pipelines, with example usage for add-a-link documented here.
  • After closer inspection, it didn't make sense to adopt the Command API from research-datasets for ml-pipelines. As ml-pipelines is a mono-repo with many independent sub-projects, the unified Command API (that automatically creates dataclasses for all "entry-point" methods) would require undesired dependencies. The added complexity is not worth the benefit.

With this, this task can be marked as resolved.