Page MenuHomePhabricator

Share code between Research & ML teams
Open, MediumPublic

Description

Goal: Make Research's reusable ML tooling easy to consume in ml-pipelines, reducing duplication and easing migrations—without heavy frameworks.

Deliverables:

  • Shared code project in ml-pipelines (name TBD)
    • Own deps + CI (tests incl. Spark snapshot, lint, build, publish)
    • Publish a wheel to GitLab PyPI for ML projects and Research
    • Seed from research-datasets (research_transformation, etc.); replace in-tree copies
    • Pin offline workload deps (e.g., Spark/Iceberg); notebook-friendly usage
    • Utilities only; no framework base classes
  • Command API (opt-in)
    • Generate typed dataclasses from entry-point args; use in Airflow tasks
    • Better defaults, type safety, IDE hints; easier usage in Jupyter
    • Pilot in one project before broader adoption

Affected repos:

  • Migrate shared code from research-datasets → new shared project in ml-pipelines
  • Update dependent projects to import the published wheel
  • Minimal DAG updates in airflow-dags for pilot if needed

Out of Scope

  • Migrating specific pipelines (revert-risk, add-a-link, inference, etc.)
  • Mandating Command API across all pipelines

Acceptance

  • Shared project created; CI green; wheel published; brief docs
  • At least one ML project consumes the package
  • Command API used in one pilot

Details

Due Date
Dec 30 2025, 12:00 AM

Event Timeline

Thanks @fkaelin for creating this task. We discussed this with @isarantopoulos and there are a few things we should figure out before we start this migration:

  • Ownership of the migrated code
  • What pipelines from Research does ML need? Do we need to migrate everything?
  • What are the dependencies?
  • How do we work together for this migration (roles and responsibilities, workflows etc)

@isarantopoulos please let me know if I am missing something. Happy to create a separate task for the above.

Miriam set Due Date to Mon, Sep 29, 11:00 PM.Jul 28 2025, 9:53 AM
Miriam moved this task from Staged to In Progress on the Research board.
Miriam triaged this task as Medium priority.Aug 26 2025, 11:00 AM
Miriam changed Due Date from Mon, Sep 29, 11:00 PM to Dec 30 2025, 12:00 AM.Thu, Oct 16, 12:58 PM

Weekly updates

  • The suggested contributions are described in this doc
  • Ongoing discussion regarding ML dev tooling (with notebook support). Prototype repo is wmfing, replacing the archived research-commons repo.

Very cool!

The 'Command API' sounds like a very useful thing outside of just ML and Research jobs. As you build it, perhaps it would make sense to consult with Data Engineering @amastilovic @mforns @xcollazo etc. to see if there might be a way to build it for general use?